Scraping data using span title and span class - python

I'm using Python Anaconda to scrape data into an Excel sheet. I'm running into some trouble with two sites.
Site 1
<div id="ember3815" class="ember-view">
<p class="org-top-card-module__company-descriptions Sans-15px-black-55%">
<span class="company-industries org-top-card-module__dot-separated-list">
Industry
</span>
<span class="org-top-card-module__location org-top-card-module__dot-separated-list">
City, State
</span>
<span title="62,346 followers" class="org-top-card-module__followers-count org-top-card-module__dot-separated-list">
62,346 followers
</span>
I'm trying to pull the span title. Things I've tried (I've also tried them all as find_all):
text = soup.find('span',{'class':"company-industries org-top-card-module__dot-separated-list"})
text = soup.find('p',{'class':"org-top-card-module__company-descriptions Sans-15px-black-55%"})
text = soup.body.find('span', attrs={'class': 'org-top-card-module__location org-top-card-module__dot-separated-list'})
text = soup.find('span',{'class': 'org-top-card-module__location org-top-card-module__dot-separated-list'})
I'm sure there are also other things I've tried that I'm not listed, because I don't remember them all. I'm not a programmer, I'm just trying to figure this out to pull data for analysis. Help?
Site 2
I need to pull the value 8,052 from the html below.
<section class="zwlfE">
<div class="nZSzR">...</div>
<ul class="k9GMp ">
<li class="Y8-fY ">...</li>
<li class-"Y8-fY ">
<a class="g47SY " title="8,052">8,052</span>" followers"
</a>
</li>
<li class="Y8-fY ">...</li>
</ul>
<div class="-vDIg">...</div>
</section>
I have tried:
text = soup.find('span',{'class': "g47SY "})
similar to above but with the div and li tags
Everything I've tried results in [].
Please help?

To get the span title
from bs4 import BeautifulSoup
html ="""<div id="ember3815" class="ember-view">
<p class="org-top-card-module__company-descriptions Sans-15px-black-55%">
<span class="company-industries org-top-card-module__dot-separated-list">
Industry
</span>
<span class="org-top-card-module__location org-top-card-module__dot-separated-list">
City, State
</span>
<span title="62,346 followers" class="org-top-card-module__followers-count org-top-card-module__dot-separated-list">
62,346 followers
</span>"""
soup = BeautifulSoup(html, "html.parser")
print( soup.find("span", class_="org-top-card-module__followers-count org-top-card-module__dot-separated-list")["title"])
Output:
62,346 followers
And for site2
print( soup.find("a", class_="g47SY")["title"])

Related

How can I parse html file using python and beautiful soup from html tag under html tag value?

My html file contains same tag(<span class="fna">) multiple times. If I want to differentiate this tag then i need to look previous tag. Tag() under tag(<span id="field-value-reporter">).
In beautiful soup, I can apply only on tag condition like, soup.find_all("span", {"id": "fna"}). This function extract all data for tag(<span class="fna">) but I need only which contain under tag(<span id="field-value-reporter")
Example html tags:
<div class="value">
<span id="field-value-reporter">
<div class="vcard vcard_287422" >
<a class="email " href="/user_profile?user_id=287422" >
<span class="fna">Chris Pearce (:cpearce)
</span>
</a>
</div>
</span>
</div>
<div class="value">
<span id="field-value-triage_owner">
<div class="vcard vcard_27780" >
<a class="email " href="/user_profile?user_id=27780">
<span class="fna">Justin Dolske [:Dolske]
</span>
</a>
</div>
</span>
</div>
Use soup.select:
soup.select('#field-value-reporter a > span') # select for all tags that are children of a tag whose id is field-value-reporter
>>> [<span class="fna">Chris Pearce (:cpearce)</span>]
soup.select uses css selector and are, in my opinion, much more capable than the default element search that comes with BeautifulSoup. Note that all results are returned as list and contains everything that match.

Unable to parse div tag using beautiful soup in python?

I am learning to use beautiful soup to parse div containers from html. But for some reason, when i pass the class name of the div containers to my beautiful soup, nothing happens. I am getting no element when i try to parse the div. What could i be doing wrong. here is my html and the parse
<div class="upcoming-date ng-hide no-league" ng-show="nav.upcoming" ng-class="{'no-league': !search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS")}">
<span class="weekday">Monday</span>
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="date ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
<div class="clear"></div>
</div>
<div id="g1390856" class="match football FOOTBALL - HIGHLIGHTS" itemscope="" itemtype="https://schema.org/SportsEvent">
<div class="leaguename ng-hide" ng-show="search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS") && (1 || (nav.upcoming && 0))">
<span class="name">
<span class="flag-icon flag-icon-swe"></span>
Sweden - Allsvenskan
</span>
</div>
<ul class="meta">
<li class="date">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
</li>
<li class="time">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="false" show-time="true" class="ng-isolate-scope"><span class="ng-binding">20:00</span></timecomponent>
</li>
<li class="game-id"><span class="gameid">GameID:</span> 2087</li>
</ul>
<ul class="teams">
<li>Hammarby</li>
<li>Ostersunds</li>
</ul>
<ul class="bet-selector">
<li class="pick01" id="b499795664">
<a data-id="499795664" ng-click="bets.pick($event, 499795664, 2087, 2.10)" class="betting-button pick-button " title="Hammarby">
<span class="team">Hammarby</span>
<span class="odd">2.10</span>
</a>
</li> <li class="pick0X" id="b499795666">
<a data-id="499795666" ng-click="bets.pick($event, 499795666, 2087, 3.56)" class="betting-button pick-button " title="Draw">
<span class="team">Draw</span>
<span class="odd">3.56</span>
</a>
</li> <li class="pick02" id="b499795668">
<a data-id="499795668" ng-click="bets.pick($event, 499795668, 2087, 3.40)" class="betting-button pick-button " title="Ostersunds">
<span class="team">Ostersunds</span>
<span class="odd">3.40</span>
</a>
</li> </ul>
<ul class="extra-picks">
<li>
<a class="betting-button " href="/games/1390856/markets?league=0&top=0&sid=2087&sportId=1">
<span class="short-desc">+13</span>
<span class="long-desc">View 13 more markets</span>
</a>
</li>
</ul>
<div class="game-stats">
<img class="img-responsive" src="/img/chart-icon.png?v2.2.25.2">
</div>
<div class="clear"></div>
</div>
.............................................................
parser.py
import requests
import urllib2
from bs4 import BeautifulSoup as soup
udata = urllib2.urlopen('https://www.sportpesa.co.ke/?sportId=1')
htmlsource = udata.read()
ssoup = soup(htmlsource,'html.parser')
page_div = ssoup.findAll("div",{"class":"match football FOOTBALL - HIGHLIGHTS"})
print page_div
"match football FOOTBALL - HIGHLIGHTS" is a dynamic class so you are just getting a blank list....
Here is my code in python3
from bs4 import BeautifulSoup as bs4
import requests
request = requests.get('https://www.sportpesa.co.ke/?sportId=1')
soup = bs4(request.text, 'lxml')
print(soup)
After printing soup you will find that this class is not present in your source code... I hope that it will help you
So - as suggested in the comment - the best (fastest) way to get data from this site is to make use of the same endpoint, that the javascript uses.
If you use Chrome, pop up the Inspector Tool, open the networks tab and load the page. You'll see, that se site gets the data from a url, that looks very similar to the one actually displayed in the url, namely
https://sportpesa.co.ke/sportgames?sportId=1
This endpoint gives you the data you need. To grab it using requests and getting the divs, you want, would be done like below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://sportpesa.co.ke/sportgames?sportId=1")
soup = BeautifulSoup(r.text, "html.parser")
page_divs = soup.select('div.match.football.FOOTBALL.-.HIGHLIGHTS')
print(len(page_divs))
That prints 30 - which is the number of divs. Btw I'm using the bs4-method select here, which is the bs4-recommended way of doing things, when you - as you do here - have not just one but multiple classes ('match', 'football', 'FOOTBALL', '-' and 'HIGHLIGHTS').

Remove html after some point in Beautilful Soup

I have a trouble. My aim is to parse the data until some moment. Then, I want to stop parsing.
<span itemprop="address">
Some address
</span>
<i class="fa fa-signal">
</i>
...
</p>
</div>
</div>
<div class="search_pagination" id="pagination">
<ul class="pagination">
</ul>
</div>
</div>
</div>
</div>
<div class="col-sm-3">
<div class="panel" itemscope="" itemtype="http://schema.org/WPSideBar">
<h2 class="heading_a" itemprop="name">
Top-10 today
</h2> #a lot of tags after that moment
I want to get all the values from <span itemprop="address"> (there are a lot of them before) until the moment Top-10 today.
You can actually let BeautifulSoup parse only the tags you are interested in via SoupStrainer:
from bs4 import BeautifulSoup, SoupStrainer
only_addresses = SoupStrainer("span", itemprop="address")
soup = BeautifulSoup(html_doc, "html.parser", parse_only=only_addresses)
If you though have some "addresses" before the "Top-10 today" and some after but you are interested in those coming before it, you can make a custom searching function:
def search_addresses(tag):
return tag.name == "span" and tag.get("itemprop") == "address" and \
tag.find_next("h2", text=lambda text: text and "Top-10 today" in text)
addresses = soup.find_all(search_addresses)
It does not look trivial, but the idea is simple - we are using find_next() for every "address" to check if "Top-10 today" heading exists after it.

Web scraping with Beautiful soup multiple tags

I'm trying to get multiple addresses from a web page with an A to Z of links.
First I get A to Z links with:
URL = "http://www.example.com"
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
content = soup.find("div", "view-content")
links = [BASE_URL + li.a["href"] for li in content.findAll("li")]
This works great and in links above I have a list of links to each individual web page with multiple addresses on each separate page.
For getting the addresses I need I used:
for item in links[0:5]:
try:
htmlss = urlopen(item).read()
soup = bfs(htmlss, "lxml")
titl = soup.find('div','views-field-title').a.contents
add = soup.find('div','views-field-address').span.contents
zipp = soup.find('div','views-field-city-state-zip').span.contents
except AttributeError:
continue
The above code will take each link and get the first address on the page with all the A's and the first address on the page with all the B's and so on.
My problem is that on some of the pages there are multiple addresses on each page and the above code only retrieves the first address on that page i.e. First A address first B address and so on.
I've tried using soup.findAll but it doesn't work with a.content or span.content
Basically I need to find the address lines in the html pages with non-unique tags. If I use soup.findAll I get all the content for say (div, views-field-title) which gives me a lot of content I don't need.
Example of some html:
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-earl-fry">
Chuck Corica Golf Complex, Earl Fry
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content"></span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
<div class="views-field-value"></div>
<div class="views-field-nothing-1"></div>
<div class="views-field-nothing">
<span class="field-content">
<div class="views-field-title">
<span class="field-content">
<a href="/golf-courses/details/ca/alameda/chuck-corica-golf-complex-jack-clark">
Chuck Corica Golf Complex, Jack Clark
</a>
</span>
</div>
<div class="views-field-address">
<span class="field-content">
1 Clubhouse Memorial Rd
<br></br>
</span>
</div>
<div class="views-field-city-state-zip">
<span class="field-content">
Alameda, California 94502-6502
</span>
</div>
</span>
</div>
This is just a sample of similar html I need to find data for. Thanks

How to extract href, alt and imgsrc using beautiful soup python

Can someone help me extract some data from the below sample html using beautiful soup python?
These are what i'm trying to extract:
The href html link : example
/movies/watch-malayalam-movies-online/6106-watch-buddy.html
The alt text which has the movie name :
Buddy 2013 Malayalam Movie
The thumbnail : example http://i44.tinypic.com/2lo14b8.jpg
(There are multiple occurrences of these..)
Full source available at : http:\\olangal.com
Sample html :
<div class="item column-1">
<h2>
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Buddy
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=36bbe22fb7c54b5465609b8a2c60d8c8a1841581" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt=" Buddy 2013 Malayalam Movie" src="http://i44.tinypic.com/2lo14b8.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6106-watch-buddy.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
<div class="item column-2">
<h2>
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Pigman
</a>
</h2>
<ul class="actions">
<li class="email-icon">
<a href="/component/mailto/?tmpl=component&template=beez_20&link=2b0dfb09b41b8e6fabfd7ed2a035f4d728bedb1a" title="Email" onclick="window.open(this.href,'win2','width=400,height=350,menubar=yes,resizable=yes'); return false;">
<img src="/media/system/images/emailButton.png" alt="Email" />
</a>
</li>
</ul>
<img width="110" height="105" alt="Pigman 2013 Malayalam Movie" src="http://i41.tinypic.com/jpa3ko.jpg" border="0" />
<p class="readmore">
<a href="/movies/watch-malayalam-movies-online/6105-watch-pigman.html">
Read more...
</a>
</p>
<div class="item-separator">
</div>
</div>
Update : Finally cracked it with help from #kroolik. Thanks to you.
Here's what worked for me:
for eachItem in soup.findAll("div", { "class":"item" }):
eachItem.ul.decompose()
imglinks = eachItem.find_all('img')
for imglink in imglinks:
imgfullLink = imglink.get('src').strip()
links = eachItem.find_all('a')
for link in links:
names = link.contents[0].strip()
fullLink = "http://olangal.com"+link.get('href').strip()
print "Extracted : " + names + " , " + imgfullLink+" , "+fullLink
You can get both <img width="110"> and <p class="read more"> using the following:
for div in soup.find_all(class_='item'):
# Will match `<p class="readmore">...</p>` that is direct
# child of the div.
p = div.find(class_='readmore', recursive=False)
# Will print `href` attribute of the first `<a>` element
# inside `p`.
print p.a['href']
# Will match `<img width="110">` that is direct child
# of the div.
img = div.find('img', width=110, recursive=False)
print img['src'], img['alt']
Note that this is for the most recent Beautiful Soup version.
I usually use PyQuery for such scrapping, it's clean and easy. You can use jQuery selectors directly with it. e.g to see your Name and reputation, I will just have to write something like
from pyquery import PyQuery as pq
d = pq(url = 'http://stackoverflow.com/users/1234402/gbzygil')
p=d('#user-displayname')
t=d('#user-panel-reputation div h1 a span')
print p.html()
So unless you can't switch from bsoup, I will strongly recommend switching to PyQuery or some library that supports XPath well.

Categories

Resources