I'm trying to add several tags to div[id='head'] at once in a BeautifulSoup with
soup.find_all('div', id='head',limit=1)[0].insert(1, soup.new_tag(u'<div id="menu_top_right" class="menu_top"><div class="menu_inner"><a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span><a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span><a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a></div></div><div class="clear"></div>'))
As a result I got the code inserted twice (with some extra < and >), but have no idea why.
<<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
</<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
I didn't find anything in the docs saying you can't create several new tags with one soup.new_tag(). What might be the problem?
I assume that you are using you are using BeautifulSoup4? If you do, the problem is that you cannot create multiple tags with one new_tag(), let alone insert HTML:
>>> soup.new_tag('<div myattr="foo"></div>')
<<div myattr="foo"></div>></<div myattr="foo"></div>>
You must create each child seperately and assign the attributes manually:
>>> parent = soup.find('div')
>>> parent
<div></div>
>>> new_tag = soup.new_tag('div')
>>> new_tag['id'] = 'menu_top_right'
>>> new_tag['class'] = 'menu_top'
>>> new_tag
<div class="menu_top" id="menu_top_right"></div>
>>> parent.insert(1, new_tag)
>>> soup
<div><div class="menu_top" id="menu_top_right"></div></div>
Edit: The syntax highlighting makes this look weird.
Related
I've got some html where a bit of it looks like
<div class="large-12 columns">
<div class="box">
<div class="header">
<h2>
Line-Ups
</h2>
</div>
<div class="large-6 columns aufstellung-box" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft">
<div>
<a class="vereinprofil_tooltip" href="/fc-portsmouth/startseite/verein/1020/saison_id/2006" id="1020">
<img alt="Portsmouth FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/1020_1564722280.png?lm=1564722280" title=" "/>
...........
<div class="large-6 columns" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft aufstellung-bordertop-small">
<div>
<a class="vereinprofil_tooltip" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
<img alt="Arsenal FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/11_1400911988.png?lm=1400911994" title=" "/>
</a>
</div>
<div>
<nobr>
<a class="sb-vereinslink" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
Arsenal FC
The key here is that <div class="large-6 shows up twice, which is what I'm trying to split on.
The code I'm using is simply boxes = soup.find_All("div",{'class',re.compile(r'^large-6 columns')}) however that is returning absolutely nothing.
I've used BeautifulSoup successfully plenty of times before and I'm sure it's something stupid that I'm missing, but I've been banging my head against a wall for the last 2 hours and can't seem to figure it out. Any help would be much appreciated.
My understanding is that Python is case sensitive. Thus, I think you need to make it soup.find_all rather than All. The code below ran with a working url.
url = "https://####.###"
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
test = soup.find_all("a")
print(test)
When I made the all into All it broke with the following error:
test = soup.find_All("a")
TypeError: 'NoneType' object is not callable
I know...from the title this answer seems the same oh thousand of others. BUT I have still searched all related and similar questions. What I'm asking is, given this html (just an exemple):
<html>
<body>
<div class="div-share noprint">
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
</div>
</div>
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="some img" class="playblk" height="25" src="othersource" title="some othertitle" width="25"/></span>
</a>
</div>
<div class="div-share">
<h1>"The Divine Wings Of Tragedy" lyrics</h1></div>,
<div class="pther">
<h2><b>Symphony X Lyrics</b></h2>
</div>
<div class="ringtone">
<span id="cf_text_top"></span>
</div>
<div>
<i>[Part I - At the Four Corners of the Earth]</i>
<br/>
<br/> On the edge of paradise
<br/> Tears of woe fall, cold as ice
<br/> Hear my cry
<br/>
</div>
</body>
</html>
I want to find the only tag that has no attributes. Not an empy attr, like I saw in other questions, or a strange specific attribute, or attrs = None ... that tag has nothing else. But if I use findAll, I find all the other tag in the html. the same if I use attrs = False, attrs = None and so on..,
so there is a possibility?
thanks a lot!
You can pass a lambda function to the find_all method that checks the tag name and that there are no attrs within the element:
soup.find_all(lambda tag: tag.name == 'div' and not tag.attrs)
I've been practicing my scraping and everything was going fine but as hard as I try I can't seem to get this specific data I'm looking for.
Structure looks like this
</div>
<div class="col-xs-12 col-sm-12 col-md-7 list-field-wrap">
<div class="pull-left">
<div class="row">
<div class=" list-field type-field" style="width: 45px"><div class="visible-xs-block visible-sm-block list-label">BIB</div>17584</div>
<div class=" list-field type-age_class" style="width: 65px"><div class="visible-xs-block visible-sm-block list-label">Division</div>20-24</div>
</div>
</div>
What I want to do is get the 17584 with class = "visible-xs-block visible-sm-block list-label"
Unfortunately every time I try to select it. It only returns
<div class="visible-xs-block visible-sm-block list-label">BIB</div>
This is my code I've been trying to select it
bib = soup.find('div', class_="visible-xs-block visible-sm-block list-label"
print(bib)
WAS ABLE TO FIGURE IT OUT STRUCTURE START EARLIER.
17584 is not part of the tag with class visible-xs-block visible-sm-block list-label:
<div class=" list-field type-field" style="width: 45px">
<div class="visible-xs-block visible-sm-block list-label">
BIB
</div>
17584
</div>
Try to select list-field type-field instead.
I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])
Can someone help me extract some data from the below sample html using beautiful soup python?
These are what i'm trying to extract:
The href html link : example
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name :
Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg
(There are multiple occurrences of these..)
Full source available at : http:\\olangal.com
Sample html :
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
Update : Finally cracked it with help from #kroolik. Thanks to you.
Here's what worked for me:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
You can get both <img width="110"> and <p class="read more"> using the following:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
Note that this is for the most recent Beautiful Soup version.
I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like
from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.