I'm having trouble using BeautifulSoup4.
I'm trying to find every link on a certain html page (https://www.gov.br/pt-br/servicos/infrabr) under a specific div, this is my code:
for a in soup.find(class_='col-servico').find_all(href = True):
print(a['href'])
This is the html block I'm interested on (its inside div class=col-servico):
<div class="servico_fieldset_outras_conteudo">
<p style="text-align: justify; ">Agora, o Sistema Eletrônico do Serviço de Informação (e-SIC) está integrado
ao <span>Fala.BR</span>.
Desenvolvida .</p>
<p style="text-align: justify; ">Em conformidade com
a
<span><a class="external-link"
href="http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm" target="_blank" title=""
data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="">
As you can see this block has two 'a' tags:
<span>Fala.BR</span>
<span><a class="external-link"
href="http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm" target="_blank" title=""
data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="">
But my result is bringing only one:
http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm
As you can see, it's only returning the 'planalto.gov.br' URL, and not the 'fala.BR' one. I can't see any difference between those two 'a' tags and I don't know why it isn't returning the first link. Could anyone help me, please?
There are a couple issues I think. First, in the link you gave, there is no href for fala br in the given div. You can manually inspect:
import httpx
res = httpx.get("https://www.gov.br/pt-br/servicos/infrabr")
soup = BeautifulSoup(res.text)
print(soup.find(class_="col-servico").prettify())
If we instead use the snippet you posted, we do get the fala BR href:
html = """
<div class="servico_fieldset_outras_conteudo">
<p style="text-align: justify; ">Agora, o Sistema Eletrônico do Serviço de Informação (e-SIC) está integrado
ao <span>Fala.BR</span>.
Desenvolvida .</p>
<p style="text-align: justify; ">Em conformidade com
a
<span><a class="external-link"
href="http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm" target="_blank" title=""
data-tippreview-enabled="false" data-tippreview-image="" data-tippreview-title="">
"""
soup = BeautifulSoup(html)
for a in soup.find(class_="servico_fieldset_outras_conteudo").find_all(href=True):
print(a["href"])
# Returns:
# http://fala.BR
# http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm
Maybe just check to see if you're using the correct HTML in your processing
There are some matches of 'col-servico' also in some comments, and comments are not tag so when you call find_all you got an error.
Here the code
from bs4 import BeautifulSoup, Tag
[...]
for div in soup.find(class_='col-servico'):
if isinstance(div, Tag):
for tag in div.find_all(href=True):
print(tag['href'])
Output
#
#
#
#
http://www.planalto.gov.br/ccivil_03/_ato2011-2014/2011/lei/l12527.htm
http://www.planalto.gov.br/ccivil_03/_ato2015-2018/2017/lei/l13460.htm
https://www.gov.br/pt-br/orgaos/ministerio-da-infraestrutura
For example, I have a big chunk of text that has HTML tags in there and I want to have a function that is removing HTML tags from my code.
But I want to delete just the tags not text.
The problem is more complicated than it seems because if you have some ol or ul tags and I want to delete ol first I don't want the text to be delete and
the li tag to be deleted but just for ol tag, not for ul.
I have tried to use BeautifulSoup and some NLP tehnics but with no success
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
html_know='''<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" class="image_master" alt="" style="width: 248px; height: 164px; vertical-align: middle;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 250px; height: 166px;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 249px; height: 165px;">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSGTVf63Vm3XgOncMVSOy0-jSxdMT8KVJIc8WiWaevuWiPGe0Pm" style="width: 249px; height: 165px;">
<p></p>
<p><strong><span style="font-family: Impact, Charcoal, sans-serif; font-size: 36px;">HTML</span></strong></p> <span style="background-color: rgb(255, 255, 0);"><p></p>HTML stands for Hyper Text Markup Language, which is the most widely used language on Web to develop web pages. HTML was created by Berners-Lee in late 1991 but "HTML 2.0" was the first standard HTML specification which was published in 1995. HTML 4.01 was a major version of HTML and it was published in late 1999.</span>Though
HTML 4.01 version is widely used but currently we are having HTML-5 version which is an extension to HTML 4.01, and this version was published in 2012. Audience This tutorial is designed for the aspiring Web Designers and Developers with a need to understand
the HTML in enough detail along with its simple overview, and practical examples. This tutorial will give you enough ingredients to start with HTML from where you can take yourself at higher level of expertise.
<p></p>
<p>
</p>
<p></p>HTML stands for Hypertext Markup Language, and it is the most widely used language to write Web Pages. Hypertext refers to the way in which Web pages (HTML documents) are linked together. Thus, the link available on a webpage is called Hypertext.
As its name suggests, HTML is a Markup Language which means you use HTML to simply "mark-up" a text document with tags that tell a Web browser how to structure it to display. Originally, HTML was developed with the intent of defining the structure of
documents like headings, paragraphs, lists, and so forth to facilitate the sharing of scientific information between researchers. Now, HTML is being widely used to format web pages with the help of different tags available in HTML language. Basic HTML
Document In its simplest form, following is an example of an HTML document
<p></p>
<p><img src="" style="width: 119px; height: 119px;"></p>
<table style="width:100%">
<tbody>
<tr>
<th style="border-color: rgb(0, 0, 0);">Firstname</th>
<th style="border-color: rgb(0, 0, 0);">Lastname</th>
<th style="border-color: rgb(0, 0, 0);">Age</th>
</tr>
<tr>
<td style="border-color: rgb(0, 0, 0);">Jill</td>
<td style="border-color: rgb(0, 0, 0);">Smith</td>
<td style="border-color: rgb(0, 0, 0);">50</td>
</tr>
<tr>
<td style="border-color: rgb(0, 0, 0);">Eve</td>
<td style="border-color: rgb(0, 0, 0);">Jackson</td>
<td style="border-color: rgb(0, 0, 0);">94</td>
</tr>
</tbody>
</table>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>HTML Tags As told earlier, HTML is a markup language and makes use of various tags to format the content. These tags are enclosed within angle braces Except few tags, most of the tags have their corresponding closing tags. For example, has its closing
tag and tag has its closing tag tag etc. Above example of HTML document uses the following tags ? Sr.No Tag & Description 1 This tag defines the document type and HTML version. 2 This tag encloses the complete HTML document and mainly comprises of
document header which is represented by ... and document body which is represented by ... tags. 3 This tag represents the document's header which can keep other HTML tags like html,head,body,title,...etc
<ol>
<li>2</li>
<li>2</li>
<li>3</li>
</ol>
<ul>
<li>sdfsdf</li>
<li>s</li>
<li>dfsd</li>
<li>f</li>
<li>sd</li>
<li>f</li>
<li>sd</li>
</ul>
<p></p>
<p><iframe width="1019px" height="311px" src="//www.youtube.com/embed/uCg2BoKiuOM" frameborder="0" allowfullscreen=""></iframe></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>'''
soup=BeautifulSoup(html_know, 'html.parser')
tags=soup.find_all('table')
print(tags[0].text)
print(html_know[3])
The idea behind this is that sometimes I want to delete some tags and other times to delete other tags.
PLease if you can give me some idea to this without to hard code everything
I have a website that has plenty of hidden tags in the html.
I have pasted the source code below.
The challenge is that there are 2 types on hidden tags,
1. Ones with style="display:none"
2. They have list of styles mentioned under every td tag.
And it changes with every td tag.
for the example below it has the following styles,
hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
So the elements with class=hLcj, kUC, mXJU, rr9s,etc are hidden elements
I want to extract the text of entire tr but exclude these hidden tags.
I have been scratching my head for hours and still no success.
Any help would be much appreciated. Thanks
I am using bs4 and python 2.7
<td class="leftborder timestamp" rel="1416853322">
<td>
<span>
<style>
.hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
</style>
<span class="rr9s">35</span>
<span></span>
<div style="display:none">121</div>
<span class="226">199</span>
.
<span class="rr9s">116</span>
<div style="display:none">116</div>
<span></span>
<span class="Dzkb">200</span>
<span style="display: inline">.</span>
<span style="display:none">86</span>
<span class="kUC-">86</span>
<span></span>
120
<span class="kUC-">134</span>
<div style="display:none">134</div>
<span class="mXJU">151</span>
<div style="display:none">151</div>
<span class="rr9s">154</span>
<span class="Dzkb">.</span>
<span class="119">36</span>
<span class="kUC-">157</span>
<div style="display:none">157</div>
<span class="rr9s">249</span>
<div style="display:none">249</div>
</span>
</td>
<td> 7808</td>
Using selenium would make the task much easier since it knows what elements are hidden and which aren't.
But, anyway, here's a basic code that you would probably need to improve more. The idea here is to parse the style tag and get the list of classes to exclude, have a list of tags to exclude and check the style attribute of each child element in tr:
import re
from bs4 import BeautifulSoup
data = """ your html here """
soup = BeautifulSoup(data)
tr = soup.tr
# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
match = re.match(r'^\.(.*?)\{display:none\}', line)
if match:
classes_to_exclude.append(match.group(1))
tags_to_exclude = ['style', 'script']
texts = []
for item in tr.find_all(text=True):
if item.parent.name in tags_to_exclude:
continue
class_ = item.parent.get('class')
if class_ and class_[0] in classes_to_exclude:
continue
if item.parent.get('style') == 'display:none':
continue
texts.append(item)
print ''.join(texts.strip())
Prints:
199.200.120.36
Also see:
BeautifulSoup Grab Visible Webpage Text
I am relatively new to Python and Scrapy. I'm trying to scrap the links in "Customers who bought this item also bought".
For example: http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/. There are 17 pages for "Customers who bought this item also bought". If I ask scrapy to scrap that url, it only scraps the first page (6 items). How do I ask scrapy to press the "Next Button" to scrap all the items in the 17 pages? A sample code (just the part that matters in the crawler.py) will be greatly appreciated. Thank you for your time!
Ok. Here is my code. As I said I am new to Python so the code might look quite stupid but it works to scrap the first page (6 items). I work mostly with Fortran or Matlab. I would love to learn Python systematically If I have time though.
# Code of my crawler.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )
def parse_item(self, response):
sel = Selector(response)
stuff = BetaItem()
isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
isbn10 = []
if len(isbn10R) > 0:
isbn10 = [(isbn10R[0].split(' '))[1]]
stuff['isbn10'] = isbn10
starsR = sel.xpath('//div[contains(#id,"averageCustomerReviews")]/span/#title').extract()
stars = []
if len(starsR) > 0:
stars = [(starsR[0].split(' '))[0]]
stuff['stars'] = stars
reviewsR = sel.xpath('//div[contains(#id,"averageCustomerReviews")]/a[contains(#href,"showViewpoints=1")]/text()').extract()
reviews = []
if len(reviewsR) > 0:
reviews = [(reviewsR[0].split(' '))[0]]
stuff['reviews'] = reviews
copsR = sel.xpath('//a[#class="sim-img-title"]/#href').extract()
ncops = len(copsR)
cops = [None] * ncops
if ncops > 0:
for idx, cop in enumerate(copsR):
cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
stuff['cops'] = cops
return stuff
So I understand you were able to scrape these "Customers Who Bought This Item Also Bought" product details. As you probably saw, these are within a ul in a div with class "shoveler-content":
<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
<a class="back-button" onclick="return false;" style="" href="#Back">
<div class="shoveler-content">
<ul tabindex="-1">
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
<div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
...
</div>
</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
</ul>
</div>
<a class="next-button" onclick="return false;" style="" href="#Next">
<span class="auiTestSprite s_shvlNext">...</span>
</a>
</div>
</div>
When you inspect your browser of choice's network activity (via Firebug or Chrome Inspect tool), when you click on the "next" button for next suggested products, you'll see an AJAX query to this sort of URL:
http://www.amazon.com
/gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
&pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
&shovelerName=purchase
(I'm using this product page: http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)
What's in the id query argument is a list of ASINs, which are the next suggested products. 12 ASINs for 6 displayed? probably some in-page caching for the next "next" click a user will probably make.
What do you get back from this AJAX query? Still within your browser's inspect tool, you'll see the response is of type application/json, and the response data is a JSON array of 12 elements, each elements being some HTML snippet, similar to:
<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
<a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
<div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div> Home Game: An Accidental Guide to Fatherhood
</a>
<div class="byline">
<span class="carat">›</span>
Michael Lewis
</div>
<div class="rating-price">
<span class="rating-stars">
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" name="B00261OOWQ">
<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
<span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
<span>4.1 out of 5 stars</span>
</span>
</a>
</span>
(99)
</span>
</span>
</div>
<div class="binding-platform"> Kindle Edition </div>
<div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div>
</div>
So you basically get what was in the original page section for suggested products earlier, in each <li> from <div class="shoveler-content"><ul>
But how do you get those ASINs codes to append to the AJAX query's id parameter?
Well, in the product page, you'll notice this section
<div id="purchaseSimsData"
class="sims-data" style="display:none"
data-baseAsin="B005CRQ2OE" data-featureId="pd_sim"
data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
data-wdg="ebooks_display_on_website" data-widgetName="purchase">
B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>
which looks like all the suggested products ASINs.
Therefore, I suggest you emulate successive AJAX queries to get suggested products, 12 ASINs at a time, decode the response using json package, and then parse each HTML snippet to extract product info you want.
I would recommend you to avoid scrapy especially since you're a beginner.
Use awesome Requests module for downloading pages
https://github.com/kennethreitz/requests
and BeautifulSoup for parsing webpages.
http://www.crummy.com/software/BeautifulSoup/.
I am trying to scrape the url http://www.kat.ph/search/beatles/?categories[]=music using BeautifulSoup
torrents = bs.findAll('tr',id = re.compile('torrent_*'))
torrents gets all the torrents on that page ,now every element of torrents contains a tr element.
My problem is that len(torrents[0].td) is 5 but i am not able to iterate over the td's.I mean something like for x in torrents[o].td is not working.
the data that i am getting for torrent[0] is :
<tr class="odd" id="torrent_2962816">
<td class="fontSize12px torrentnameCell">
<div class="iaconbox floatedRight">
<a title="Torrent magnet link" href="magnet:?xt=urn:btih:0898a4b562c1098eb69b9b801c61a51d788df0f5&dn=the+beatles+2009+greatest+hits+cdrip+ikmn+reupld&tr=http%3A%2F%2Ftracker.publicbt.com%2Fannounce" onclick="_gaq.push(['_trackEvent', 'Download', 'Magnet Link', 'Music']);" class="imagnet icon16"></a>
<a title="Download torrent file" href="http://torrage.com/torrent/0898A4B562C1098EB69B9B801C61A51D788DF0F5.torrent?title=[kat.ph]the.beatles.2009.greatest.hits.cdrip.ikmn.reupld" onclick="_gaq.push(['_trackEvent', 'Download', 'Download torrent file', 'Music']);" class="idownload icon16"></a>
<a class="iPartner2 icon16" href="http://www.downloadweb.org/checking.php?acode=b146a357c57fddd450f6b5c446108672&r=d&qb=VGhlIEJlYXRsZXMgWzIwMDldIEdyZWF0ZXN0IEhpdHMgQ0RSaXAtIGlLTU4gUmVVUGxk" onclick="_gaq.push(['_trackEvent', 'Download', 'Download movie']);"></a>
<a class="iverif icon16" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html" title="Verified Torrent"></a> <a rel="2962816,0" class="icomment" href="/the-beatles-2009-greatest-hits-cdrip-ikmn-reupld-t2962816.html#comments_tab">
<span class="icommentdiv"></span>145
</a>
</div>
<div class="torrentname">
The <strong class="red">Beatles</strong> [2009] Greatest Hits CDRip- iKMN ReUPld
<span>
Posted by <a class="plain" href="/user/iKMN/">iKMN</a>
<img src="http://static.kat.ph/images/verifup.png" alt="verified" /> in
<span id="cat_2962816">
Music
</span></span>
</div>
</td>
<td class="nobr">168.26 <span>MB</span></td>
<td>42</td>
<td>1 year</td>
<td class="green">1368</td>
<td class="red lasttd">94</td>
</tr>
I'd recommend using lxml or instead of BeautifulSoup, among other great features you can use xpath to grab your links:
import lxml.html
doc = lxml.html.parse('http://www.kat.ph/search/beatles/?categories[]=music')
links = doc.xpath('//a[contains(#class,"idownload")]/#href')