BeautifulSoup new_tag inserted twice - python

I'm trying to add several tags to div[id='head'] at once in a BeautifulSoup with
soup.find_all('div', id='head',limit=1)[0].insert(1, soup.new_tag(u'<div id="menu_top_right" class="menu_top"><div class="menu_inner"><a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span><a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span><a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a></div></div><div class="clear"></div>'))
As a result I got the code inserted twice (with some extra < and >), but have no idea why.
<<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
</<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
I didn't find anything in the docs saying you can't create several new tags with one soup.new_tag(). What might be the problem?

I assume that you are using you are using BeautifulSoup4? If you do, the problem is that you cannot create multiple tags with one new_tag(), let alone insert HTML:
>>> soup.new_tag('<div myattr="foo"></div>')
<<div myattr="foo"></div>></<div myattr="foo"></div>>
You must create each child seperately and assign the attributes manually:
>>> parent = soup.find('div')
>>> parent
<div></div>
>>> new_tag = soup.new_tag('div')
>>> new_tag['id'] = 'menu_top_right'
>>> new_tag['class'] = 'menu_top'
>>> new_tag
<div class="menu_top" id="menu_top_right"></div>
>>> parent.insert(1, new_tag)
>>> soup
<div><div class="menu_top" id="menu_top_right"></div></div>
Edit: The syntax highlighting makes this look weird.

Related

Beautifulsoup find_All command not working

I've got some html where a bit of it looks like
<div class="large-12 columns">
<div class="box">
<div class="header">
<h2>
Line-Ups
</h2>
</div>
<div class="large-6 columns aufstellung-box" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft">
<div>
<a class="vereinprofil_tooltip" href="/fc-portsmouth/startseite/verein/1020/saison_id/2006" id="1020">
<img alt="Portsmouth FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/1020_1564722280.png?lm=1564722280" title=" "/>
...........
<div class="large-6 columns" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft aufstellung-bordertop-small">
<div>
<a class="vereinprofil_tooltip" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
<img alt="Arsenal FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/11_1400911988.png?lm=1400911994" title=" "/>
</a>
</div>
<div>
<nobr>
<a class="sb-vereinslink" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
Arsenal FC
The key here is that <div class="large-6 shows up twice, which is what I'm trying to split on.
The code I'm using is simply boxes = soup.find_All("div",{'class',re.compile(r'^large-6 columns')}) however that is returning absolutely nothing.
I've used BeautifulSoup successfully plenty of times before and I'm sure it's something stupid that I'm missing, but I've been banging my head against a wall for the last 2 hours and can't seem to figure it out. Any help would be much appreciated.
My understanding is that Python is case sensitive. Thus, I think you need to make it soup.find_all rather than All. The code below ran with a working url.
url = "https://####.###"
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
test = soup.find_all("a")
print(test)
When I made the all into All it broke with the following error:
test = soup.find_All("a")
TypeError: 'NoneType' object is not callable

Beautifulsoup, find the only tag in the htm that has no attribute

I know...from the title this answer seems the same oh thousand of others. BUT I have still searched all related and similar questions. What I'm asking is, given this html (just an exemple):
<html>
<body>
<div class="div-share noprint">
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="someimg" class="playblk" height="25" src="some source" title="sometitle" width="25"/></span>
</a>
</div>
</div>
<div class="addthis_toolbox addthis_default_style">
<a class="btn btn-xs btn-share addthis_button_facebook" href="https://somelink" target="_blank">
<span class="playblk"><img alt="some img" class="playblk" height="25" src="othersource" title="some othertitle" width="25"/></span>
</a>
</div>
<div class="div-share">
<h1>"The Divine Wings Of Tragedy" lyrics</h1></div>,
<div class="pther">
<h2><b>Symphony X Lyrics</b></h2>
</div>
<div class="ringtone">
<span id="cf_text_top"></span>
</div>
<div>
<i>[Part I - At the Four Corners of the Earth]</i>
<br/>
<br/> On the edge of paradise
<br/> Tears of woe fall, cold as ice
<br/> Hear my cry
<br/>
</div>
</body>
</html>
I want to find the only tag that has no attributes. Not an empy attr, like I saw in other questions, or a strange specific attribute, or attrs = None ... that tag has nothing else. But if I use findAll, I find all the other tag in the html. the same if I use attrs = False, attrs = None and so on..,
so there is a possibility?
thanks a lot!
You can pass a lambda function to the find_all method that checks the tag name and that there are no attrs within the element:
soup.find_all(lambda tag: tag.name == 'div' and not tag.attrs)

Unable to get whole row from BeautifulSoup

I've been practicing my scraping and everything was going fine but as hard as I try I can't seem to get this specific data I'm looking for.
Structure looks like this
</div>
<div class="col-xs-12 col-sm-12 col-md-7 list-field-wrap">
<div class="pull-left">
<div class="row">
<div class=" list-field type-field" style="width: 45px"><div class="visible-xs-block visible-sm-block list-label">BIB</div>17584</div>
<div class=" list-field type-age_class" style="width: 65px"><div class="visible-xs-block visible-sm-block list-label">Division</div>20-24</div>
</div>
</div>
What I want to do is get the 17584 with class = "visible-xs-block visible-sm-block list-label"
Unfortunately every time I try to select it. It only returns
<div class="visible-xs-block visible-sm-block list-label">BIB</div>
This is my code I've been trying to select it
bib = soup.find('div', class_="visible-xs-block visible-sm-block list-label"
print(bib)
WAS ABLE TO FIGURE IT OUT STRUCTURE START EARLIER.
17584 is not part of the tag with class visible-xs-block visible-sm-block list-label:
<div class=" list-field type-field" style="width: 45px">
<div class="visible-xs-block visible-sm-block list-label">
BIB
</div>
17584
</div>
Try to select list-field type-field instead.

Any way to only extract specific div from beautiful soup

I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])

How to extract href, alt and imgsrc using beautiful soup python

Can someone help me extract some data from the below sample html using beautiful soup python?
These are what i'm trying to extract:
The href html link : example
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name :
Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg
(There are multiple occurrences of these..)
Full source available at : http:\\olangal.com
Sample html :
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
Update : Finally cracked it with help from #kroolik. Thanks to you.
Here's what worked for me:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
You can get both <img width="110"> and <p class="read more"> using the following:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
Note that this is for the most recent Beautiful Soup version.
I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like
from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.

Categories

Resources