Python Beautiful soup to scrape urls from a web page - python

I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)

Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))

Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link

You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])

Related

Unable to parse div tag using beautiful soup in python?

I am learning to use beautiful soup to parse div containers from html. But for some reason, when i pass the class name of the div containers to my beautiful soup, nothing happens. I am getting no element when i try to parse the div. What could i be doing wrong. here is my html and the parse
<div class="upcoming-date ng-hide no-league" ng-show="nav.upcoming" ng-class="{'no-league': !search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS")}">
<span class="weekday">Monday</span>
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="date ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
<div class="clear"></div>
</div>
<div id="g1390856" class="match football FOOTBALL - HIGHLIGHTS" itemscope="" itemtype="https://schema.org/SportsEvent">
<div class="leaguename ng-hide" ng-show="search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS") && (1 || (nav.upcoming && 0))">
<span class="name">
<span class="flag-icon flag-icon-swe"></span>
Sweden - Allsvenskan
</span>
</div>
<ul class="meta">
<li class="date">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
</li>
<li class="time">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="false" show-time="true" class="ng-isolate-scope"><span class="ng-binding">20:00</span></timecomponent>
</li>
<li class="game-id"><span class="gameid">GameID:</span> 2087</li>
</ul>
<ul class="teams">
<li>Hammarby</li>
<li>Ostersunds</li>
</ul>
<ul class="bet-selector">
<li class="pick01" id="b499795664">
<a data-id="499795664" ng-click="bets.pick($event, 499795664, 2087, 2.10)" class="betting-button pick-button " title="Hammarby">
<span class="team">Hammarby</span>
<span class="odd">2.10</span>
</a>
</li> <li class="pick0X" id="b499795666">
<a data-id="499795666" ng-click="bets.pick($event, 499795666, 2087, 3.56)" class="betting-button pick-button " title="Draw">
<span class="team">Draw</span>
<span class="odd">3.56</span>
</a>
</li> <li class="pick02" id="b499795668">
<a data-id="499795668" ng-click="bets.pick($event, 499795668, 2087, 3.40)" class="betting-button pick-button " title="Ostersunds">
<span class="team">Ostersunds</span>
<span class="odd">3.40</span>
</a>
</li> </ul>
<ul class="extra-picks">
<li>
<a class="betting-button " href="/games/1390856/markets?league=0&top=0&sid=2087&sportId=1">
<span class="short-desc">+13</span>
<span class="long-desc">View 13 more markets</span>
</a>
</li>
</ul>
<div class="game-stats">
<img class="img-responsive" src="/img/chart-icon.png?v2.2.25.2">
</div>
<div class="clear"></div>
</div>
.............................................................
parser.py
import requests
import urllib2
from bs4 import BeautifulSoup as soup
udata = urllib2.urlopen('https://www.sportpesa.co.ke/?sportId=1')
htmlsource = udata.read()
ssoup = soup(htmlsource,'html.parser')
page_div = ssoup.findAll("div",{"class":"match football FOOTBALL - HIGHLIGHTS"})
print page_div
"match football FOOTBALL - HIGHLIGHTS" is a dynamic class so you are just getting a blank list....
Here is my code in python3
from bs4 import BeautifulSoup as bs4
import requests
request = requests.get('https://www.sportpesa.co.ke/?sportId=1')
soup = bs4(request.text, 'lxml')
print(soup)
After printing soup you will find that this class is not present in your source code... I hope that it will help you
So - as suggested in the comment - the best (fastest) way to get data from this site is to make use of the same endpoint, that the javascript uses.
If you use Chrome, pop up the Inspector Tool, open the networks tab and load the page. You'll see, that se site gets the data from a url, that looks very similar to the one actually displayed in the url, namely
https://sportpesa.co.ke/sportgames?sportId=1
This endpoint gives you the data you need. To grab it using requests and getting the divs, you want, would be done like below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://sportpesa.co.ke/sportgames?sportId=1")
soup = BeautifulSoup(r.text, "html.parser")
page_divs = soup.select('div.match.football.FOOTBALL.-.HIGHLIGHTS')
print(len(page_divs))
That prints 30 - which is the number of divs. Btw I'm using the bs4-method select here, which is the bs4-recommended way of doing things, when you - as you do here - have not just one but multiple classes ('match', 'football', 'FOOTBALL', '-' and 'HIGHLIGHTS').

how to get text from span in python using scrapy?

I'm placing here HTML code :
<div class="rendering rendering_person rendering_short rendering_person_short">
<h3 class="title">
<a rel="Person" href="https://moh-it.pure.elsevier.com/en/persons/massimo-eraldo-abate" class="link person"><span>Massimo Eraldo Abate</span></a>
</h3>
<ul class="relations email">
<li class="email"><span>massimo.abate#ior.it</span></li>
</ul>
<p class="type"><span class="family">Person: </span>Academic</p>
</div>
From above code how to extract Massimo Eraldo Abate?
Please help me.
You can extract the name using
response.xpath('//h3[#class="title"]/a/span/text()').extract_first()
Also, look at this Scrapinghub's blogpost for introduction to XPath.
Please take a look at this page. there are lots of ways of extracting text
scrapy docs
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()

Web Scraping with BeautifulSoup -- Python

I need to scrape the code below, to retrieve the portions that say "SCRAPE THIS" and "SCRAPE THIS AS WELL". I have been playing around with it for a few hours with no luck! Does anyone know how this can be done?
<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>
try this code:
from bs4 import BeautifulSoup
text = """<div class="mod-body add-border"> <div class="mod-inline mod-body-A-F"> <h4>SCRAPE THIS</h4> <div class="mod-body"> <ul class="list"> <li>SCRAPE THIS AS WELL</li> </ul> </div> </div>"""
x = BeautifulSoup(text, 'lxml')
print(x.find('h4').get_text())
print(x.find('li').get_text())

Remove html after some point in Beautilful Soup

I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.

use beautiful soup to parse a href from given html structure

I have the following given html structure
<li class="g">
<div class="vsc">
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</div>
</li>
The above html structure keeps repeating, what can be the easiest way to parse all the links(stackoverflow.com) from the above html structure using BeautifulSoup and Python?
BeautifulSoup 4 offers a convenient way of accomplishing this, using CSS selectors:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print [a["href"] for a in soup.select('h3.r a')]
This also has the advantage of constraining the selection by context: it selects only those anchor nodes that are children of a h3 node with class r.
Omitting the constraint or choosing one most suitable for the need is easy by just tweaking the selector; see the CSS selector docs for that.
Using CSS selectors as proposed by Petri is probably the best way to do it using BS. However, i can't hold myself back to recommend using lxml.html and xpath, that are pretty much perfect for the job.
Test html:
html="""
<html>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="alpha"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
<li class="g">
<div class="vsc"></div>
<div class="gamma"></div>
<div class="beta"></div>
<h3 class="r">
</h3>
</li>
</html>"""
and it's basically a oneliner:
import lxml.html as lh
doc=lh.fromstring(html)
doc.xpath('.//li[#class="g"][div/#class = "vsc"][div/#class = "alpha"][div/#class = "beta"][h3/#class = "r"]/h3/a/#href')
Out[264]:
['http://www.correct.com', 'http://www.correct.com']

Categories

Resources