I've got some html where a bit of it looks like
<div class="large-12 columns">
<div class="box">
<div class="header">
<h2>
Line-Ups
</h2>
</div>
<div class="large-6 columns aufstellung-box" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft">
<div>
<a class="vereinprofil_tooltip" href="/fc-portsmouth/startseite/verein/1020/saison_id/2006" id="1020">
<img alt="Portsmouth FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/1020_1564722280.png?lm=1564722280" title=" "/>
...........
<div class="large-6 columns" style="padding: 0px;">
<div class="unterueberschrift aufstellung-unterueberschrift-mannschaft aufstellung-bordertop-small">
<div>
<a class="vereinprofil_tooltip" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
<img alt="Arsenal FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/11_1400911988.png?lm=1400911994" title=" "/>
</a>
</div>
<div>
<nobr>
<a class="sb-vereinslink" href="/fc-arsenal/startseite/verein/11/saison_id/2006" id="11">
Arsenal FC
The key here is that <div class="large-6 shows up twice, which is what I'm trying to split on.
The code I'm using is simply boxes = soup.find_All("div",{'class',re.compile(r'^large-6 columns')}) however that is returning absolutely nothing.
I've used BeautifulSoup successfully plenty of times before and I'm sure it's something stupid that I'm missing, but I've been banging my head against a wall for the last 2 hours and can't seem to figure it out. Any help would be much appreciated.
My understanding is that Python is case sensitive. Thus, I think you need to make it soup.find_all rather than All. The code below ran with a working url.
url = "https://####.###"
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
test = soup.find_all("a")
print(test)
When I made the all into All it broke with the following error:
test = soup.find_All("a")
TypeError: 'NoneType' object is not callable
Related
I have run into an issue while working on a web scraping project in python. I am new to python and am not sure how to extract a specific line, or a value from part of a line, from the beautiful soup output. I would like to get only the data-rarity part from this site but i haven't found how to do that without removing the entire line from the list.
Any help is much appreciated!
I have this:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
print(rarity[0])
This outputs:
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
I would ideally want to get only the value after the data-rarity so just the 102 part from this in the inspect element of the site.
<div class="profileCards__cards">
<div class="profileCards__card upgrade " data-level="902" data-elixir="2" data-rarity="102" data-arena="802">
<img src="//cdn.statsroyale.com/images/cards/full/snowball.png"><span class="profileCards__level">lvl.9</span>
<div class="profileCards__meter">
<span style="width: 100%"></span>
<div class="profileCards__meter__numbers">
8049/800
</div>
</div>
<div class="ui__tooltip ui__tooltipTop ui__tooltipMiddle cards__tooltip">
Giant Snowball
</div>
</div>
Use:
rarity = soup.find_all('div', {'class': 'profileCards__card'})
for r in rarity:
print(r.find("div", {'class': 'profileCards__card'})["data-rarity"])
I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])
I need to scrape the code below, to retrieve the portions that say "SCRAPE THIS" and "SCRAPE THIS AS WELL". I have been playing around with it for a few hours with no luck! Does anyone know how this can be done?
<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>
try this code:
from bs4 import BeautifulSoup
text = """<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>"""
x = BeautifulSoup(text, 'lxml')
print(x.find('h4').get_text())
print(x.find('li').get_text())
I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.
I'm trying to add several tags to div[id='head'] at once in a BeautifulSoup with
soup.find_all('div', id='head',limit=1)[0].insert(1, soup.new_tag(u'<div id="menu_top_right" class="menu_top"><div class="menu_inner"><a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span><a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span><a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a></div></div><div class="clear"></div>'))
As a result I got the code inserted twice (with some extra < and >), but have no idea why.
<<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
</<div id="menu_top_right" class="menu_top">
<div class="menu_inner">
<a class="" target="_blank" href="./local/zkratky/index.html">Zkratky</a><span>|</span>
<a class="" target="_blank" href="./local/slovnik/index.html">Slovník</a><span>|</span>
<a class="" target="blank" href="./local/dokumenty/index.html">Dokumenty</a>
</div>
</div>
<div class="clear"></div>>
I didn't find anything in the docs saying you can't create several new tags with one soup.new_tag(). What might be the problem?
I assume that you are using you are using BeautifulSoup4? If you do, the problem is that you cannot create multiple tags with one new_tag(), let alone insert HTML:
>>> soup.new_tag('<div myattr="foo"></div>')
<<div myattr="foo"></div>></<div myattr="foo"></div>>
You must create each child seperately and assign the attributes manually:
>>> parent = soup.find('div')
>>> parent
<div></div>
>>> new_tag = soup.new_tag('div')
>>> new_tag['id'] = 'menu_top_right'
>>> new_tag['class'] = 'menu_top'
>>> new_tag
<div class="menu_top" id="menu_top_right"></div>
>>> parent.insert(1, new_tag)
>>> soup
<div><div class="menu_top" id="menu_top_right"></div></div>
Edit: The syntax highlighting makes this look weird.