first time using beautifulsoup.
Trying to scrape a value from a website with the following structure:
<div class="overview">
<i class="fa fa-instagram"></i>
<div class="overflow-h">
<small>Value #1 here</small>
<small>131,390,555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
<div class="overview">
<i class="fa fa-facebook"></i>
<div class="overflow-h">
<small>Value #2 here</small>
<small>555</small>
<div class="progress progress-u progress-xxs">
<div style="width: 13%" aria-valuemax="100" aria-valuemin="0" aria-valuenow="92" role="progressbar" class="progress-bar progress-bar-u">
</div>
</div>
</div>
</div>
I want the second <small>131,390,555</small> in the first <div class="overview"></div>
This is the code I am trying to use:
# Get the hashtag popularity and add it to a dictionary
for hashtag in hashtags:
popularity = []
url = ('http://url.com/hashtag/'+hashtag)
r = requests.get(url, headers=headers)
if (r.status_code == 200):
soup = BeautifulSoup(r.content, 'html5lib')
overview = soup.findAll('div', attrs={"class":"overview"})
print overview
for small in overview:
popularity.append(int(small.findAll('small')[1].text.replace(',','')))
if popularity:
raw[hashtag] = popularity[0]
#print popularity[0]
print raw
time.sleep(2)
else:
continue
The code works as long as the second <small>-value is populated in both div-overviews. I really only need the second small-value from the first overview-div.
I have tried to get it like this:
overview = soup.findAll('div', attrs={"class":"overview"})[0]
But I only get this error:
self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'findAll'
Also is there somehow to not "break" the script if the is no small-value at all? (Have the script just replace the empty value with an zero, and continue)
you can use index but I suggest to use CSS selector and nth-child()
soup = BeautifulSoup(html, 'html.parser')
# only get first result
small = soup.select_one('.overview small:nth-child(2)')
print(small.text.replace(',',''))
# all results
secondSmall = soup.select('.overview small:nth-child(2)')
for small in secondSmall:
popularity.append(int(small.text.replace(',','')))
print(popularity)
If you just want the 2nd small tag in the 1st div only, this will work:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
small_tag_2 = overview[0].findAll('small')[1]
print(small_tag_2)
If you want the 2nd small tag in every overview div, iterate using the loop:
soup = BeautifulSoup(r.content, 'html.parser')
overview = soup.findAll('div', class_ = 'overview')
for div in overview:
small_tag_2 = div.findAll('small')[1]
print(small_tag_2)
Note: I used html.parser instead of html5lib. If you know how to work with html5lib, then it's your choice.
Related
I stuck in getting all data within span tag. My code gives me only every first value in every a() within the span tag and ignore other values. In my example: (NB I reduced the span contents here, but it lot of inside)
<span class="block-niveaux-sponsors">
<a href="http://www.keolis.com/" id="a47-logo-part-keolis" target="_blank">
<img src="images/visuels_footer/footer/part_keolis.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
<span class="block-niveaux-sponsors">
<a href="http://www.cg47.fr/" id="a47-logo-part-cg47" target="_blank">
<img src="images/visuels_footer/footer/part_cg47.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
<span class="block-niveaux-sponsors">
<a href="http://www.errea.it/fr/" id="a47-logo-part-errea" target="_blank">
<img src="images/visuels_footer/footer/part_errea.201910210940.jpg"/>
</a>
<div class="clearfix"></div>
</span>
My code is:
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
title_box_agen = soup.find_all('div', attrs={'id':'autres'})
for tag in title_box_agen:
for each_row in tag.find_all('span'):
links = each_row.find('a', href=True)
title = links.get('id')
print(title)
This give me only the first id values in .
I want all id.
You should try:
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
title_box_agen = soup.find_all('div', attrs={'id':'autres'})
for tag in title_box_agen:
for each_row in tag.find_all('span'):
links = each_row.find_all('a', href=True)
for link in links:
title = link.get('id')
print(title)
You can get all the link ids for each of the niveux class like this.
(not tested)
page = urlopen(lien_suagen)
soup = bs(page, 'html.parser')
spans_niveux = soup.find_all('span' class_='block-niveaux-sponsors')
for span in spans_niveux:
link = span.find('a', href=True)
id = link.id
print(id)
Here's the HTML code:
<div class="sizeBlock">
<div class="size">
<a class="selectSize" id="44526" data-size-original="36.5">36.5</a>
</div>
<div class="size inactive active">
<a class="selectSize" id="44524" data-size-original="40">40</a>
</div>
<div class="size ">
<a class="selectSize" id="44525" data-size-original="40.5">40.5</a>
</div>
</div>
I want to get the values of the id tag and the data-size-original.
Here's my code:
for sizeBlock in soup.find_all('a', class_="selectSize"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
The problem is that it gets the values of other ids that have the same class "selectSize".
I think this is what you want. You won't have ids and size from data in div class='size inactive active'
for sizeBlock in soup.select('div.size a.selectSize'):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-us')
Already answered here How to Beautiful Soup (bs4) match just one, and only one, css class
Use soup.select. Here's a simple test:
from bs4 import BeautifulSoup
html_doc = """<div class="size">
<a class="selectSize otherclass" id="44526" data-ean="0193394075362" " data-tprice="" data-sku="1171177-36.5" data-size-original="36.5">5</a>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
#for sizeBlock in soup.find_all('a', class_= "selectSize"): # this would include the anchor
for sizeBlock in soup.select("a[class='selectSize']"):
aid = sizeBlock.get('id')
size = sizeBlock.get('data-size-original')
print aid, size
I am trying to extract multiple factors from each of the repeated tags in a HTML file.
....
<div class="title">
<a target="_blank" id="jl_fe575975c912af9e" href="https://www.indeed.com/company/Nestvestor/jobs/Data-Science-Intern-fe575975c912af9e?fccid=8eed076a625928e7&vjs=3" onmousedown="return rclk(this,jobmap[0],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[0],true,0);" rel="noopener nofollow" title="Data Science Intern" class="jobtitle turnstileLink " data-tn-element="jobTitle">
Data Science Intern</a>
</div>
<div class="sjcl">
<div>
<span class="company">
Nestvestor</span>
</div>
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="p_9cfaca3374641aa0" data-jk="9cfaca3374641aa0" data-tn-component="organicJob">
<div class="title">
<a target="_blank" id="jl_9cfaca3374641aa0" href="https://www.indeed.com/rc/clk?jk=9cfaca3374641aa0&fccid=1779658d5b4ae2b0&vjs=3" onmousedown="return rclk(this,jobmap[1],0);" onclick=" setRefineByCookie(['radius']); return rclk(this,jobmap[1],true,0);" rel="noopener nofollow" title="Product Manager" class="jobtitle turnstileLink " data-tn-element="jobTitle">
Product Manager</a>
</div>
<div class="sjcl">
<div>
<span class="company">
<a data-tn-element="companyName" class="turnstileLink" target="_blank" href="https://www.indeed.com/cmp/Sojern" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&campaignid=serp-linkcompanyname&fromjk=9cfaca3374641aa0&jcid=1779658d5b4ae2b0')" rel="noopener">
Sojern</a></span>
...
soup = BeautifulSoup(open(input("Enter a file to read: ")), "html.parser")
title = soup.find_all('div', class_='title')
for span in title:
print(span.text)
company = soup.find_all('span', class_='company')
for span in company:
print(span.text)
So far I have figured out how to get the following result:
Job_Title1
Job_Title2
Job_Title3
And in a different code result:
Company_name1
Company_Name2
Company_Name3
How do I get the results to look look like this with one run of code:
Job_Title1,Company_Name1,
Job_Title2,Company_Name2,
Job_Title3,Company_Name3,
Welcome to Stack Overflow,Just use this:
company = soup.find_all('span', class_='company')
title = soup.find_all('div', class_='title')
for t,c in zip(title, company):
print ("Job_Title :%s Company_Name :%s" %(t.text,c.text))
From what you have it looks like you need to nest your loops. Without the website, it is hard to tell but I would try something like this.
company = soup.find_all('span', class_='company')
title = soup.find_all('div', class_='title')
for span in title:
for x in company:
print(x.text,span.text)
I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
I have written python script to scrape data from http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings
It is a list of 100 players and I successfully scraped this data. The problem is, when i run script instead of scraping data just one time it scraped the same data 3 times.
<div class="cb-col cb-col-100 cb-font-14 cb-lst-itm text-center">
<div class="cb-col cb-col-16 cb-rank-tbl cb-font-16">1</div>
<div class="cb-col cb-col-50 cb-lst-itm-sm text-left">
<div class="cb-col cb-col-33">
<div class="cb-col cb-col-50">
<span class=" cb-ico" style="position:absolute;"></span> –
</div>
<div class="cb-col cb-col-50">
<img src="http://i.cricketcb.com/i/stats/fw/50x50/img/faceImages/2250.jpg" class="img-responsive cb-rank-plyr-img">
</div>
</div>
<div class="cb-col cb-col-67 cb-rank-plyr">
<a class="text-hvr-underline text-bold cb-font-16" href="/profiles/2250/steven-smith" title="Steven Smith's Profile">Steven Smith</a>
<div class="cb-font-12 text-gray">AUSTRALIA</div>
</div>
</div>
<div class="cb-col cb-col-17 cb-rank-tbl">906</div>
<div class="cb-col cb-col-17 cb-rank-tbl">1</div>
</div>
And here is python script which i write scrap each player data.
import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
but instead of scraping the data once, it scrapes the same data 3 times.
Where can I make changes to get data just one time?
Select the table and look for the divs in that:
maindiv = soup.select("#batsmen-tests div.text-center")
for div in maindiv:
print(div.text)
Your original output and that above gets all the text from the divs as one line which is not really useful, if you just want the player names:
anchors = soup.select("#batsmen-tests div.cb-rank-plyr a")
for a in anchors:
print(a.text)
A quick and easy way to get the data in a nice csv format is to just get text from each child:
maindiv = soup.select("#batsmen-tests div.text-center")
for d in maindiv[1:]:
row_data = u",".join(s.strip() for s in filter(None, (t.find(text=True, recursive=False) for t in d.find_all())))
if row_data:
print(row_data)
Now you get output like:
# rank, up/down, name, country, rating, best rank
1,–,Steven Smith,AUSTRALIA,906,1
2,–,Joe Root,ENGLAND,878,1
3,–,Kane Williamson,NEW ZEALAND,876,1
4,–,Hashim Amla,SOUTH AFRICA,847,1
5,–,Younis Khan,PAKISTAN,845,1
6,–,Adam Voges,AUSTRALIA,802,5
7,–,AB de Villiers,SOUTH AFRICA,802,1
8,–,Ajinkya Rahane,INDIA,785,8
9,2,David Warner,AUSTRALIA,772,3
10,–,Alastair Cook,ENGLAND,770,2
11,1,Misbah-ul-Haq,PAKISTAN,764,6
As opposed to:
PositionPlayerRatingBest Rank
Player
1 –Steven SmithAUSTRALIA9061
2 –Joe RootENGLAND8781
3 –Kane WilliamsonNEW ZEALAND8761
4 –Hashim AmlaSOUTH AFRICA8471
5 –Younis KhanPAKISTAN8451
6 –Adam VogesAUSTRALIA8025
The reason you get output three times is because the website has three categories you have to select it and then accordingly you can use it.
Simplest way of doing it with your code would be to add just one line
import sys,requests,csv,io
from bs4 import BeautifulSoup
url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen- rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")
specific_div = soup.find_all("div", {"id": "batsmen-tests"})
maindiv = specific_div[0].find_all("div", {"class": "text-center"})
for div in maindiv:
print(div.text)
This will give similar reuslts with just test batsmen, for other output just change the "id" in specific_div line.