Get every second URL from HTML in Python - python

I am struggling with scraping URL's from a website. The HTML code from the website I want to scrape is:
<tr>
<td>
<span>
<table class="search-result-ad-row" cellspacing="3" border="0">
<tbody>
<tr>
<td class="picture" rowspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><img src="/data/189/RE0005055/ads/195/RE0005055-16-000281/img/thum/37587134.jpeg" alt=""/></a>
</td>
<td class="title" colspan="2"><a title="3.izbový byt v starom meste na ulici Kpt. Nálepku" href="inzerat/RE0005055-16-000281/3-izbovy-byt-v-starom-meste-na-ulici-kpt-nalepku"><h2 style="font-size: inherit;">3.izbový byt v starom meste na ulici Kpt. Nálepku</h2></a>
<span></span>
</td>
</tr>
<tr>
I want to get the href by using this python code:
br = mechanize.Browser()
br.open("http://www.reality.sk/")
br.select_form(nr=0)
br["tabs:scrn243:scrn115:errorTooltip.cityName:cityName"]="poprad"
br.submit()
def hello():
soup = BeautifulSoup(br.response().read())
for link in soup.findAll('a'):
link2 = link.get('href')
if "inzerat/" in link2:
print 'http://www.reality.sk/' + link.get('href')
But the problem is I get 2 results for each URL (because there are 2 href attributes). I have tried to scrape using the table tag, the td tag with a class attribute (either "picture" or "title") or even using rowspan (=2). But I am not getting the desired result. I don't know how to make code work.

I guess you had issues with looking up by the class selector. Also you can chain the tags returned by find - please take a look if this solution helps (I'm not 100% sure if that's what you want to achieve):
soup.find_all('table', class_='search-result-ad-row')
for ad_table in soup.find_all('table', class_='search-result-ad-row'):
print ad_table.find(class_='picture').find('a').attrs['href']

Related

how to extract the text from the following HTML code?

I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care

Python scrape specific tag without class name

I'm developing a python script to scrape data from a specific site.
I'm using Beautiful Soap as python module.
The interesting data into HTML page are into this structure:
<tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">Name<a>
</td>
<td class="nowrap"></td>
<td class="hidden-xs"></td>
</tr>
</tbody>
into tag tbody there are more tr tag and I would like take to each only first tag a of tag td
I have tried in this way:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
a = soup.find(id='tabella_falist')
b = a.find("tbody")
link = [p.attrs['href'] for p in b.select("a")]
but in this way the script take all href into all td tag. How can take only first?
Thanks
If I understood correctly you can try this:
from bs4 import BeautifulSoup
import requests
url = 'your_url'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.a)
soup.a will return the first a tag on the page.
This should do the work
html = '''<html><body><tbody aria-live="polite" aria-relevant="all">
<tr style="">
<td>
<a href="www.server.com/art/crag">GOOD ONE<a>
<a href="www.server.com/art/crag">NOT GOOD ONE<a>
</td>
<td class="nowrap">
GOOD ONE
</td>
<td class="hidden-xs"></td>
</tr>
</tbody></body></html>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for td in soup.select('td'):
a = td.find('a')
if a is not None:
print a.attrs['href']

Parsing html in with BeautifulSoup fails to find a table

I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?
use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table

Fetching the <td> element using Beautiful Soup

I am able to fetch the text in TD element with class < list_selected > element using beautiful soup
soup.find_all(class_ = {"list_selected"})
I have to fetch the NAME part after that. There are number of similar blocks.
<tr>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="list_selected">1422</td>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="data">123456</td>
<td align="left" style="padding-left: 3px;padding-right:3px;" class="data">NAME</td>
</tr>
soup.find_all("td", { "class" : "list_selected" })
This will fetch the td nodes for you. The result is a list of nodes according to the documentation.
Beautiful soap has got a method called (.text) to get inner contents of a html file
I corrected your code below to get the inner text
from bs4 import BeautifulSoup
soup1 = BeautifulSoup('<td align="left" style="padding-left:3px;padding-right:3px;" class="list_selected">1422</td>',"lxml")
second= soup1.find("td", {"class": "list_selected"}) #Finding td class
name = second.text #Getting inner text contents of td class
print name #Displays inner text
Hope you got everything correct :)

Python Code to get the html data of a table present in the source page

I am new to python and I am trying to scrape a website.
I am able to log in into a website and get a html page, but i dont need the whole page, i just need the hyperlink in the specified table.
I have written the below code, but this gets all the hyperlinks.
soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in soup.findAll('a'):
print link.get('href')
Can anyone help me where am i going wrong?
Below is the html text of the table
<table id="ctl00_Main_lvMyAccount_Table1" width="680px">
<tr id="ctl00_Main_lvMyAccount_Tr1">
<td id="ctl00_Main_lvMyAccount_Td1">
<table id="ctl00_Main_lvMyAccount_itemPlaceholderContainer" border="1" cellspacing="0" cellpadding="3">
<tr id="ctl00_Main_lvMyAccount_Tr2" style="background-color:#0090dd;">
<th id="ctl00_Main_lvMyAccount_Th1"></th>
<th id="ctl00_Main_lvMyAccount_Th2">
<a id="ctl00_Main_lvMyAccount_SortByAcctNum" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctNum">Account number</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th4">
<a id="ctl00_Main_lvMyAccount_SortByServAdd" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_ServiceAddress">Service address</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th5">
<a id="ctl00_Main_lvMyAccount_SortByAcctName" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctName">Name</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th6">
<a id="ctl00_Main_lvMyAccount_SortByStatus" href="javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')">
<font color=white>
<span id="ctl00_Main_lvMyAccount_AcctStatus">Account status</span>
</font>
</a>
</th>
<th id="ctl00_Main_lvMyAccount_Th3"></th>
</tr>
<tr>
<td>
Thanks in advance.
Well, this is the right way to do it.
soup = BeautifulSoup(the_page)
for table in soup.findAll('table',{'id':'ctl00_Main_lvMyAccount_Table1'} ):
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute
Also, you can skip the parent loop, since there would be only one match for the specified id:
soup = BeautifulSoup(the_page)
table = soup.find('table',{'id':'ctl00_Main_lvMyAccount_Table1'})
for link in table.findAll('a'): #search for links only in the table
print link['href'] #get the href attribute
Update: Noticed what #DSM said. Fixed a missing quote in the table assignment.
Make sure your for loop looks up in the table html (and not soup variable, which is the page html):
from bs4 import BeautifulSoup
page = BeautifulSoup(the_page)
table = page.find('table', {'id': 'ctl00_Main_lvMyAccount_Table1'})
links = table.findAll('a')
# Print href
for link in links:
link['href']
Result
In [8]: table = page.find('table', {'id' : 'ctl00_Main_lvMyAccount_Table1'})
In [9]: links = table.findAll('a')
In [10]: for link in links:
....: print link['href']
....:
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctNum','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByServAdd','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByAcctName','')
javascript:__doPostBack('ctl00$Main$lvMyAccount$SortByStatus','')
Your nested loop for link in soup.findAll('a'): is searching the entire HTML page.
If you want to search for links within the table change that line to:
for link in table.findAll('a'):

Categories

Resources