i want to scrape data using python script - python

I have written python script to scrape data from http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings
It is a list of 100 players and I successfully scraped this data. The problem is, when i run script instead of scraping data just one time it scraped the same data 3 times.
<div class="cb-col cb-col-100 cb-font-14 cb-lst-itm text-center">
<div class="cb-col cb-col-16 cb-rank-tbl cb-font-16">1</div>
<div class="cb-col cb-col-50 cb-lst-itm-sm text-left">
<div class="cb-col cb-col-33">
<div class="cb-col cb-col-50">
<span class=" cb-ico" style="position:absolute;"></span> –
</div>
<div class="cb-col cb-col-50">
<img src="http://i.cricketcb.com/i/stats/fw/50x50/img/faceImages/2250.jpg" class="img-responsive cb-rank-plyr-img">
</div>
</div>
<div class="cb-col cb-col-67 cb-rank-plyr">
<a class="text-hvr-underline text-bold cb-font-16" href="/profiles/2250/steven-smith" title="Steven Smith's Profile">Steven Smith</a>
<div class="cb-font-12 text-gray">AUSTRALIA</div>
</div>
</div>
<div class="cb-col cb-col-17 cb-rank-tbl">906</div>
<div class="cb-col cb-col-17 cb-rank-tbl">1</div>
</div>
And here is python script which i write scrap each player data.
import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
but instead of scraping the data once, it scrapes the same data 3 times.
Where can I make changes to get data just one time?

Select the table and look for the divs in that:
maindiv = soup.select("#batsmen-tests div.text-center")
for div in maindiv:
print(div.text)
Your original output and that above gets all the text from the divs as one line which is not really useful, if you just want the player names:
anchors = soup.select("#batsmen-tests div.cb-rank-plyr a")
for a in anchors:
print(a.text)
A quick and easy way to get the data in a nice csv format is to just get text from each child:
maindiv = soup.select("#batsmen-tests div.text-center")
for d in maindiv[1:]:
row_data = u",".join(s.strip() for s in filter(None, (t.find(text=True, recursive=False) for t in d.find_all())))
if row_data:
print(row_data)
Now you get output like:
# rank, up/down, name, country, rating, best rank
1,–,Steven Smith,AUSTRALIA,906,1
2,–,Joe Root,ENGLAND,878,1
3,–,Kane Williamson,NEW ZEALAND,876,1
4,–,Hashim Amla,SOUTH AFRICA,847,1
5,–,Younis Khan,PAKISTAN,845,1
6,–,Adam Voges,AUSTRALIA,802,5
7,–,AB de Villiers,SOUTH AFRICA,802,1
8,–,Ajinkya Rahane,INDIA,785,8
9,2,David Warner,AUSTRALIA,772,3
10,–,Alastair Cook,ENGLAND,770,2
11,1,Misbah-ul-Haq,PAKISTAN,764,6
As opposed to:
PositionPlayerRatingBest Rank
Player
1    –Steven SmithAUSTRALIA9061
2    –Joe RootENGLAND8781
3    –Kane WilliamsonNEW ZEALAND8761
4    –Hashim AmlaSOUTH AFRICA8471
5    –Younis KhanPAKISTAN8451
6    –Adam VogesAUSTRALIA8025

The reason you get output three times is because the website has three categories you have to select it and then accordingly you can use it.
Simplest way of doing it with your code would be to add just one line
import sys,requests,csv,io
from bs4 import BeautifulSoup
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen- rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
specific_div = soup.find_all("div", {"id": "batsmen-tests"})
maindiv = specific_div[0].find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
This will give similar reuslts with just test batsmen, for other output just change the "id" in specific_div line.

Related

Beautifulsoup get specific instance of class

first time using beautifulsoup.
Trying to scrape a value from a website with the following structure:
<div class="overview">
<i class="fa fa-instagram"></i>
<div class="overflow-h">
<small>Value #1 here</small>
<small>131,390,555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
<div class="overview">
<i class="fa fa-facebook"></i>
<div class="overflow-h">
<small>Value #2 here</small>
<small>555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
I want the second <small>131,390,555</small> in the first <div class="overview"></div>
This is the code I am trying to use:
# Get the hashtag popularity and add it to a dictionary
for hashtag in hashtags:
popularity = []
url = ('http://url.com/hashtag/'+hashtag)
r = requests.get(url, headers=headers)
if (r.status_code == 200):
soup = BeautifulSoup(r.content, 'html5lib')
overview = soup.findAll('div', attrs={"class":"overview"})
print overview
for small in overview:
popularity.append(int(small.findAll('small')[1].text.replace(',','')))
if popularity:
raw[hashtag] = popularity[0]
#print popularity[0]
print raw
time.sleep(2)
else:
continue
The code works as long as the second <small>-value is populated in both div-overviews. I really only need the second small-value from the first overview-div.
I have tried to get it like this:
overview = soup.findAll('div', attrs={"class":"overview"})[0]
But I only get this error:
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'findAll'
Also is there somehow to not "break" the script if the is no small-value at all? (Have the script just replace the empty value with an zero, and continue)
you can use index but I suggest to use CSS selector and nth-child()
soup = BeautifulSoup(html, 'html.parser')
# only get first result
small = soup.select_one('.overview small:nth-child(2)')
print(small.text.replace(',',''))
# all results
secondSmall = soup.select('.overview small:nth-child(2)')
for small in secondSmall:
popularity.append(int(small.text.replace(',','')))
print(popularity)
If you just want the 2nd small tag in the 1st div only, this will work:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
small_tag_2 = overview[0].findAll('small')[1]
print(small_tag_2)
If you want the 2nd small tag in every overview div, iterate using the loop:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
for div in overview:
small_tag_2 = div.findAll('small')[1]
print(small_tag_2)
Note: I used html.parser instead of html5lib. If you know how to work with html5lib, then it's your choice.

I want to get the value of multiple ids inside an a tag that resides in a div

Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".
I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size

Finding a variable values inside a div class with Beautiful soup

I'm trying to find longitudes and latitudes in this html code:
<div class="map-outer-wrap">
<div class="map-wrap" data-zoom="15" style="height:500px;" data-latitude="37.4418834" data-longitude="-122.14301949999998" data-style="color">
<div data-latitude="37.4418834" data-longitude="-122.14301949999998"></div>
</div>
View on Google Map
</div>
(The full page is here: https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/)
Not exactly knowing where to start to actually get to data-latitude and data-longitude, I tried to narrow down my search to get to the closest div (map-wrap), but even this returns an empty list.
parser = LinkParser()
data, links = parser.getLinks("https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/)
lnglat = BeautifulSoup(data, "lxml").findAll("div", {"class": "map-wrap"}).text
What's the proper way to retrieve the values of data-latitude and data-longitude in this page?
You can access the attributes like key-value pairs
Ex:
s = """
<div class="map-outer-wrap">
<div class="map-wrap" data-zoom="15" style="height:500px;" data-latitude="37.4418834" data-longitude="-122.14301949999998" data-style="color">
<div data-latitude="37.4418834" data-longitude="-122.14301949999998"></div>
</div>
View on Google Map
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
print( soup.find("div", class_="map-wrap")["data-latitude"] )
print( soup.find("div", class_="map-wrap")["data-longitude"] )
Output:
37.4418834
-122.14301949999998
Try this:
from bs4 import BeautifulSoup
import requests
s = requests.get("https://www.towncity.com/property/whole-hotel-for-sale-in-riverside-area/")
soup = BeautifulSoup(s.content, "lxml")
print(soup.find("div", class_="map-wrap")["data-latitude"])
print(soup.find("div", class_="map-wrap")["data-longitude"])
output:
37.4418834
-122.14301949999998

Python: extract all the information(src, href, title) inside the class

I found that I can extract all the information I want from this HTML. I need to extract title, href abd src from this.
HTML:
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3090" class="main">
<img src="/FileUploads/Post/3090.jpg?w=70&h=70&mode=crop" alt="apple" title="apple" />
</a>
</div>
<div class="col-md-2 col-sm-2 col-xs-2 home-hot-thumb">
<a itemprop="url" href="/slim?p=3091" class="main">
<img src="/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana" />
</a>
</div>
Code:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
for b in a.select(".main"):
print ("http://www.cad.com"+b.get('href'))
print(b.get('title'))
I can successfully get href from this, but since title and src are in another line, I don't know how to extract them. After this, I want to save them in excel, so maybe I need to finish one first then do the second one.
Expected output:
/slim?p=3090
apple
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
/slim?p=3091
banana
/FileUploads/Post/3091.jpg?w=70&h=70&mode=crop" alt="banana" title="banana
My own solution:
import requests
from bs4 import BeautifulSoup
res = requests.get('http://www.cad.com/')
soup = BeautifulSoup(res.text,"lxml")
for a in soup.findAll('div', {"id":"home"}):
div = a.findAll('div', {"class": "home-hot-thumb"})
for div in div:
title=(div.img.get('title'))
print(title)
href=('http://www.cad.com/'+div.a.get('href'))
print(href)
src=('http://www.cad.com/'+div.img.get('src'))
print(src.replace('?w=70&h=70&mode=crop', ''))

extracting data from div tags Python

I am trying to scrape data from a webpage that has some of the data nested in div tags.
url = 'http://london2012.fiba.com/pages/eng/fe/12/olym/p/gid/26/grid/A/rid/9087/sid/6233/game.html'
boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)
linescoreA = soup.find("div", {"class": "scoreA"})
print linescoreA
outputs this:
<div class="scoreA">
<div class="period">19</div>
<div class="period">22</div>
<div class="period">22</div><div class="period">26</div>
<div class="final">89</div>
<div class="clear"></div>
</div>
and that is where I get stuck. How do I get the data from the div tags?
To get just the textual data, use .stripped_strings:
print list(linescoreA.stripped_strings)
Try
for node in soup.find("div", {"class": "scoreA"}):
print ''.join(node.findAll(text=True))
and what about
for node in soup.find("div", {"class": "scoreA"}):
print node.string
I am sorry, i cannot try for you.

Categories

Resources