Parsing using Beautifulsoup - python

Is it possible to extract content which comes after text Final Text: (not a tag) using Beautifulsoup.
i.e. expecting only
<td>0 / 22 FAIL</td></tr><tr>
Problem here is many tags doesn't have class or id etc. If i exctract only <td>, i will get all which is not required.
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td></tr><tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td></tr></table>

You can find the <strong>Final Text:</strong> tag using find('strong', text='Final Text:'). Then, you can use the find_next() method to get the next <td> tag.
html = '''
<table>
<tr>
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td>
</tr>
<tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td>
</tr>
</table>
'''
soup = BeautifulSoup(html, 'lxml')
txt = soup.find('strong', text='Final Text:').find_next('td')
print(txt)
Output:
<td>0 / 22 FAIL</td>

Yes, its possible, consider this HTML
<table>
<tr>
<td><strong>Final Text:</strong></td>
<td>0 / 22 FAIL</td>
</tr>
<tr>
<td><strong>Ext:</strong></td>
<td>343 / 378 FAIL</td>
</tr>
</table>
This xpath will work
//*[contains(text(),'Final Text')]/parent::td/parent::tr/following-sibling::tr
Find tag containing text Final Text, get its parent td, then get its parent tr then get its following sibling tr

If the content which you're trying to get always comes after the first index of the <td></td> tag. Why not get the second index of the list of elements?
soup = BeautifulSoup(html)
td_list = soup.find('td')
td_list[1] # This would be the FAIL element

Related

How do I get an item but only if is a sibiling of a certain tag

I have a long html but here's a fragment:
<tr>
<td data-bind="text:name, css: isActive() ? 'variable-active': 'variable-inactive'" class="variable-active">Vehicle</td>
<td data-bind="text:value">Ford</td>
</tr>
<tr>
<td data-bind="text:name, css: isActive() ? 'variable-active': 'variable-inactive'" class="variable-inactive">Model</td>
<td data-bind="text:value">Focus</td>
</tr>
I want to get all the content tags based on if it is "variable-active", and then get the value from the next 'td' tag. In this case, as the second class tag is "variable-inactive", the output should be:
"Vehicle - Ford"
I managed to get the first tags based on the "variable-active" but I can't get the second values from the other tags. This is my code:
from bs4 import BeautifulSoup
with open ("html.html","r") as f:
doc = BeautifulSoup(f,"html.parser")
tag = doc.findAll("tr")[0]
print(tag.findAll(class_="variable-active")[0].contents[0]) #vehicle
tag.findNextSibling(class_="variable-active") # nothing
You want to structure your search a little bit different:
tag = soup.findAll("tr")[0]
tag1 = tag.find(class_="variable-active") # <-- use .find
tag2 = tag1.findNextSibling() # <-- use tag1.findNextSibling() to find next sibling tag
print(tag1.text) # <-- use .text to get all text from tag
print(tag2.text)
Prints:
Vehicle
Ford
Another version using CSS selectors:
data = soup.select(".variable-active, .variable-active + *")
print(" - ".join(d.text for d in data))
Prints:
Vehicle - Ford

Parse integer from a "td" tag with beautifulsoup

I read many articles about beautifulsoup but still I do not understand. I need an example.
I want to get the value of "PD/DD" which is 1,9.
Here is the source:
<div class="table vertical">
<table>
<tbody>
<tr>
<th>F/K</th>
<td>A/D</td>
</tr>
<tr>
<th>FD/FAVÖK</th>
<td>19,7</td>
</tr>
<tr>
HERE--> <th>PD/DD</th>
HERE--> <td>1,9</td>
</tr>
<tr>
<th>FD/Satışlar</th>
<td>5,1</td>
</tr>
<tr>
<th>Yabancı Oranı (%)</th>
<td>2,43</td>
</tr>
<tr>
<th>Ort Hacim (mn$) 3A/12A</th>
<td>1,3 / 1,6</td>
</tr>
My code is:
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
b=soup_bank.find_all("div", {"class": "table vertical"})
for i in b:
children = i.findChildren("td" , recursive=True)
for child in children:
l=[]
l_text = child.text
l.append(l_text)
print(l)
When i run this code it gives me a list with 1 index.
['Afyon Çimento ']
['11.04.1990']
['Çimento üretip satmak ve ana faaliyet konusu ile ilgili her türlü yan sanayi kuruluşlarına iştirak etmek.']
['(0216)5547000']
['(0216)6511415']
['Kısıklı Cad. Sarkusyan-Ak İş Merkezi S Blok kat:2 34662 Altunizade - Üsküdar / İstanbul']
['A/D']
['19,7']
['1,9']
['5,1']
['2,43']
['1,3 / 1,6']
['407,0 mnTL']
['395,0 mnTL']
['-']
How can I get only PD/DD value. I am expecting something like:
PD/DD : 1,9
My preference:
With bs4 4.7.1 you can use :contains to target the th by its text value then take the adjacent sibling td.
import requests
from bs4 import BeautifulSoup
a="afyon"
url_bank = "https://www.isyatirim.com.tr/tr-tr/analiz/hisse/sayfalar/sirket-karti.aspx?hisse={}".format(a.upper())
response_bank = requests.get(url_bank)
html_content_bank = response_bank.content
soup_bank = BeautifulSoup(html_content_bank, "html.parser")
print(soup_bank.select_one('th:contains("PD/DD") + td').text)
You could also use :nth-of-type for positional matching (3rd row 1st column):
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td:nth-of-type(1)').text
As we are using select_one, which returns first match, we can shorten to:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) td').text
If id static
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) td').text
You already know the PD/DD but that could be gained by:
soup_bank.select_one('.vertical table:not([class]) tr:nth-of-type(3) th').text
If those ids remain static for at least a while then
soup_bank.select_one('#ctl00_ctl45_g_76ae4504_9743_4791_98df_dce2ca95cc0d tr:nth-of-type(3) th').text

find next td based on td with span tag in

How to find next td of a td with a span in it?
html_text = """
<tr class="someClass">
<td> </td>
<td>A normal string</td>
<td class="someClass">10</td>
<td class="someClass">11</td>
<td class="someClass">12</td>
<td> </td>
</tr>
<tr class="someClass">
<td> </td>
<td>Non normal string <span style="font-size:10px">(with span)</span></td>
<td class="someClass">2 000</td>
<td class="someClass">2 100</td>
<td class="someClass">2 150</td>
<td> </td>
</tr>
"""
To get the td after the td with "A normal string" in it I would simply just find it by:
a_normal_string = str(soup.find("td", text="A normal string").find_next('td'))
a_normal_string = re.findall(r'\d+', a_normal_string)
print a_normal_string #['10']
However, in the second tr where i need to find the td after the td with a Non normal string above method will not work. So how to deal with a td containing spans?
First thought was to find it by regex and compile a_nonnormal_string = str(soup.find("td", text=re.compile(r'A non normal string')).find_next('td')) but this is not applicable as well.
This is just an example of two trs but the actually website has hundreds of trs.
One option would be to solve it with a searching function, using get_text() to check the text against a desired string (note that get_text() returns the complete text of an element including its child elements, but .string does not - it would be None if there are child elements - this is actually the reason why your second approach does not work):
tds = soup.find_all(lambda tag: tag.name == "td" and "normal string" in tag.get_text())
for td in tds:
a_normal_string = td.find_next('td').get_text()
print(a_normal_string)
Prints:
10
2 000

Using Python + BeautifulSoup to pick up text in a table on webpage

I want to pick up a date on a webpage.
The original webpage source code looks like:
<TR class=odd>
<TD>
<TABLE class=zp>
<TBODY>
<TR>
<TD><SPAN>Expiry Date</SPAN>2016</TD></TR></TBODY></TABLE></TD>
<TD> </TD>
<TD> </TD></TR>
I want to pick up the ‘2016’ but I fail. The most I can do is:
page = urllib2.urlopen('http://www.thewebpage.com')
soup = BeautifulSoup(page.read())
a = soup.find_all(text=re.compile("Expiry Date"))
And I tried:
b = a[0].findNext('').text
print b
and
b = a[0].find_next('td').select('td:nth-of-type(1)')
print b
neither of them works out.
Any help? Thanks.
There are multiple options.
Option #1 (using CSS selector, being very explicit about the path to the element):
from bs4 import BeautifulSoup
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = BeautifulSoup(data)
span = soup.select('tr.odd table.zp > tbody > tr > td > span')[0]
print span.next_sibling.strip() # prints 2016
We are basically saying: get me the span tag that is directly inside the td that is directly inside the tr that is directly inside tbody that is directly inside the table tag with zp class that is inside the tr tag with odd class. Then, we are using next_sibling to get the text after the span tag.
Option #2 (find span by text; think it is more readable)
span = soup.find('span', text=re.compile('Expiry Date'))
print span.next_sibling.strip() # prints 2016
re.compile() is needed since there could be multi-lines and additional spaces around the text. Do not forget to import re module.
An alternative to the css selector is:
import bs4
data = """
<TR class="odd">
<TD>
<TABLE class="zp">
<TBODY>
<TR>
<TD>
<SPAN>
Expiry Date
</SPAN>
2016
</TD>
</TR>
</TBODY>
</TABLE>
</TD>
<TD> </TD>
<TD> </TD>
</TR>
"""
soup = bs4.BeautifulSoup(data)
exp_date = soup.find('table', class_='zp').tbody.tr.td.span.next_sibling
print exp_date # 2016
To learn about BeautifulSoup, I recommend you read the documentation.

How to extract 2nd column in html table in python?

<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
I need to grab all the td's in column 2 in python.I am new to python.
import urllib2
from bs4 import BeautifulSoup
url = "http://ccdsiu.byethost33.com/magento/adamo-13.html"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'madhu'})
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
for trr in trdata:
print trr
I am trying to get data from above code .It is printing all the td elements in table.I am trying to achieve this by Xpath
I don't think you can use xpath like you mentioned with BeautifulSoup. However, the lxml module, which comes with python, can do it.
from lxml import etree
table = '''
<table style="width:300px" border="1">
<tr>
<td>John</td>
<td>Doe</td>
<td>80</td>
</tr>
<tr>
<td>ABC</td>
<td>abcd</td>
<td>80</td>
</tr>
<tr>
<td>EFC</td>
<td>efc</td>
<td>80</td>
</tr>
</table>
'''
parser = etree.HTMLParser()
tree = etree.fromstring(table, parser)
results = tree.xpath('//tr/td[position()=2]')
print 'Column 2\n========'
for r in results:
print r.text
Which when run prints
Column 2
========
Doe
abcd
efc
You don't have to iterate over your td elements. Use this:
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
if len(tddata) >= 2:
print tddata[1]
Lists are indexed starting from 0. I check the length of the list to make sure that second td exists.
It is not clear really what you want since your example of html is not relevant and the description of just second column tds isnt really helpful. Anyway I modified Elmos answer to give you the Importance title and then the actual importance level of each thing.
for div in data:
trdata = div.findAll('tr')
tddata = div.findAll('td')
count = 0
for i in range(0, len(tddata)):
if count % 6 == 0:
print tddata[count + 1]
count += 1

Categories

Resources