Fetching the <td> element using Beautiful Soup - python

I am able to fetch the text in TD element with class < list_selected > element using beautiful soup
soup.find_all(class_ = {"list_selected"})
I have to fetch the NAME part after that. There are number of similar blocks.
<tr>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="list_selected">1422</td>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="data">123456</td>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="data">NAME</td>
</tr>

soup.find_all("td", { "class" : "list_selected" })
This will fetch the td nodes for you. The result is a list of nodes according to the documentation.

Beautiful soap has got a method called (.text) to get inner contents of a html file
I corrected your code below to get the inner text
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('<td align="left" style="padding-left:3px;padding-right:3px;" class="list_selected">1422</td>',"lxml")
second= soup1.find("td", {"class": "list_selected"}) #Finding td class
name = second.text #Getting inner text contents of td class
print name #Displays inner text
Hope you got everything correct :)

Related

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

Scrape data link and name informations with beautiful soup inside a python nested loop

I'm trying to scrape the data information from a website.
The html structure is like that:
<tbody>
<tr id="city_1">
<td class="first">Name_1</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
<tr id="city_1">
<td class="first">Name_2</td>
<td style="text-align: right;"><span class="text">247 380</span></td>
<td class="hidden-xs"><span class="text">NRW</span></td>
<td class="hidden-xs last"><span class="text">52062</span></td>
</tr>
</tbody>
I created a nested loop in python with beautiful soup package to access the hyperlink in which is store the information that I need (the link and the name).
Below my code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
#get all the city links of the page
page = requests.get("link")
#print(page)
soup = BeautifulSoup(page.content, "html.parser")
#print(soup)
for x in soup.tbody:
for y in x:
for z in y:
print(z.find('a')) #here the problem.
I don't know how to get the href and the name with soup for every hyperlinks of the list.
Try this:
for x in soup.tbody.find_all('td',class_='first'):
print(x.find('a').get('href'),x.text)
Output:
http://www.aachen.de/ Aachen
http://www.aalen.de/ Aalen
http://www.amberg.de/ Amberg
etc.

Selenium table search not returning correct text

I'm currently learning how to use selenium in python, I have a table, and I want to retrieve the element but currently facing some trouble.
<table class="table" id="SearchTable">
<thead>..</thead>
<tfoot>..</tfoot>
<tbody>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here</b>
</a>
</td>
</tr>
<tr>
<td class="icon">..</td>
<td class="title">
<a class="qtooltip">
<b>I want to get the text here as well</b>
</a>
</td>
</tr>
</table>
Inside this table, I want to access the text in the bold tag but my program isn't returning the correct number of tr, in fact I'm not even sure if its searching the correct stuff.
I have backtracked my problem from the end text and found that the errors started appearing from the line with comment. (I think the code afterwards is wrong as well but I'm focusing on getting the correct table row first)
My code is:
search_table = driver.find_element_by_id("SearchTable")
search_table_body = search_table.find_element(By.TAG_NAME, "tbody")
trs = search_table_body.find_elements(By.TAG_NAME, "tr")
print(trs) # this does not return correct number of tr)
for tr in trs:
tds = tr.find_elements(By.TAG_NAME, "td")
for td in tds:
href = td.find_element_by_class_name("qtooltip")
print(href.get_attribute("innerHtml"))
I'm supposed to get the correct number of tr count so I can return the text in the anchor tag but I am stuck. Any help is appreciated. Thanks!
You can get all <b> tags which are children of <a> tag having class attribute of qtooltip and living inside a table cell using a single XPath selector
//table/descendant::a[#class='qtooltip']/b
Example code:
elements = driver.find_elements_by_xpath("//table/descendant::a[#class='qtooltip']/b")
for element in elements:
print(element.text)
Demo:
References:
XPath Tutorial
XPath Axes
XPath Operators & Functions

Parsing html in with BeautifulSoup fails to find a table

I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?
use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table

Get every second URL from HTML in Python

I am struggling with scraping URL's from a website. The HTML code from the website I want to scrape is:
<tr>
<td>
<span>
<table class="search-result-ad-row" cellspacing="3" border="0">
<tbody>
<tr>
<td class="picture" rowspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><img src="/data/189/RE0005055/ads/195/RE0005055-16-000281/img/thum/37587134.jpeg" alt=""/></a>
</td>
<td class="title" colspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><h2 style="font-size: inherit;">3.izbový byt v starom meste na ulici Kpt. Nálepku</h2></a>
<span></span>
</td>
</tr>
<tr>
I want to get the href by using this python code:
br = mechanize.Browser()
br.open("http://www.reality.sk/")
br.select_form(nr=0)
br["tabs:scrn243:scrn115:errorTooltip.cityName:cityName"]="poprad"
br.submit()
def hello():
soup = BeautifulSoup(br.response().read())
for link in soup.findAll('a'):
link2 = link.get('href')
if "inzerat/" in link2:
print 'http://www.reality.sk/' + link.get('href')
But the problem is I get 2 results for each URL (because there are 2 href attributes). I have tried to scrape using the table tag, the td tag with a class attribute (either "picture" or "title") or even using rowspan (=2). But I am not getting the desired result. I don't know how to make code work.
I guess you had issues with looking up by the class selector. Also you can chain the tags returned by find - please take a look if this solution helps (I'm not 100% sure if that's what you want to achieve):
soup.find_all('table', class_='search-result-ad-row')
for ad_table in soup.find_all('table', class_='search-result-ad-row'):
print ad_table.find(class_='picture').find('a').attrs['href']

Categories

Resources