how to get attribute data using python beautiful soup - python

Hi am trying to use python beautiful-soup web crawler to get data from imdb i have followed the documentation online am able to retrieve all the data using this code
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)
with the above code am able to retrieve a list of all the data in the div class tagged as image just as show below
<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>
but am trying to get the value of the attributes data-const as gotten from the result i want to display just the values of the data-const attribute instead of the whole html result Expected Result : tt1486497, tt1485650

Instead use the class name that div is using.
from bs4 import BeautifulSoup
html = """<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg##._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>"""
soup = BeautifulSoup(html, "lxml")
for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
print(div["data-const"])
Output:
tt1486497
tt1485650

Try something along the lines of:
for dc in movie_containers.select('div.hover-over-image'):
print(dc['data-const'])
output:
tt1486497
tt1485650

I recommend using requests-html. It's more intuitive than just using beautiful soup.
Example:
from requests_html import HTMLSession
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
session = HTMLSession()
response = session.get(url)
html = response.html
imageContainers = html.find_all("div.image")
dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
This should exactly do what you need, but I couldn't test it
Good luck!

Related

bs4 - how to use find or find_all to get specific content from an url

I need to get the individual url for each country after the "a href=" under the "div" class of "well span4". For example,I need to get https://www.rulac.org/browse/countries/myanmar and https://www.rulac.org/browse/countries/the-netherlands and every url after "a href=" (as shown in the partial html structure below.
since the "a href=" is not under any class, how do I conduct a search and get all the countries url?
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all("div", class_="well span4")
# Partial html structure shown as below ​
[<div class="well span4">
<a href="https://www.rulac.org/browse/countries/myanmar">
<div class="map-wrap">
<img alt="Myanmar" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=19.7633057,96.07851040000003&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Myanmar"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Myanmar</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/myanmar">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/the-netherlands">
<div class="map-wrap">
<img alt="Netherlands" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=52.203566364441,5.7275408506393&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Netherlands"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Netherlands</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/the-netherlands">Read on <i class="icon-caret-right"></i></a>
</div>,
<div class="well span4">
<a href="https://www.rulac.org/browse/countries/niger">
<div class="map-wrap">
<img alt="Niger" src="https://maps.googleapis.com/maps/api/staticmap?size=700x700&zoom=5¢er=13.5115963,2.1253854000000274&format=png&style=feature:administrative.locality%7Celement:all%7Cvisibility:off&style=feature:water%7Celement:all%7Chue:0xEDF9FF%7Clightness:80%7Csaturation:9&style=feature:road%7Celement:all%7Cvisibility:off&style=feature:landscape%7Celement:all%7Chue:0xE0EADC&key=AIzaSyBN1vexCTXoQaavAWZULZTwnIAWoYtAvwU" title="Niger"/>
<img class="marker" src="https://www.rulac.org/assets/images/templates/marker-country.png"/>
</div>
</a>
<h2>Niger</h2>
<a class="btn" href="https://www.rulac.org/browse/countries/niger">Read on <i class="icon-caret-right"></i></a>
</div>,
You can use soup.select() with a CSS selector to get all <a> elements of class btn that are children of <div>s with classes well and span4. Like this:
import requests
from bs4 import BeautifulSoup
url = "https://www.rulac.org/browse/countries/P36"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.select("div.well.span4 > a.btn")
# get all hrefs in a list and print it
hrefs = [el['href'] for el in res]
for href in hrefs:
print(href)

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

How to get all hrefs(inside <a tag) and assign them to a variable??

I need all hrefs present in 'a' tag and assign it to a variable
I did this, but only got first link
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
userName = soup_level1.find(class_='_32mo')
link1 = (userName.get('href'))
And the output i get is
print(link1)
https://www.facebook.com/xxxxxx?ref=br_rs
But i need atleast top 3 or top 5 links
The structure of webpage is
`<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
I need those hrefs
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
for link in my_links:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
https://www.facebook.com/zzzzz?ref=br_rs
To get top n links you can use
max_num_of_links=2
for link in my_links[:max_num_of_links]:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
You can also save the top n links to a list
link_list=[]
max_num_of_links=2
for link in my_links[:max_num_of_links]:
link_list.append(link.get('href'))
print(link_list)
Output
['https://www.facebook.com/xxxxx?ref=br_rs', 'https://www.facebook.com/yyyyy?ref=br_rs']
EDIT:
If you need the driver to get the links one by one
max_num_of_links=3
for link in my_links[:max_num_of_links]:
driver.get(link.get('href'))
# rest of your code ...
For some reason if you want it in different variables like link1,link2 etc..
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
link1=my_links[0].get('href')
link2=my_links[1].get('href')
link3=my_links[2].get('href')
# and so on, but be careful here you don't want to try to access a link which is not there or you'll get index error

Unable to parse div tag using beautiful soup in python?

I am learning to use beautiful soup to parse div containers from html. But for some reason, when i pass the class name of the div containers to my beautiful soup, nothing happens. I am getting no element when i try to parse the div. What could i be doing wrong. here is my html and the parse
<div class="upcoming-date ng-hide no-league" ng-show="nav.upcoming" ng-class="{'no-league': !search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS")}">
<span class="weekday">Monday</span>
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="date ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
<div class="clear"></div>
</div>
<div id="g1390856" class="match football FOOTBALL - HIGHLIGHTS" itemscope="" itemtype="https://schema.org/SportsEvent">
<div class="leaguename ng-hide" ng-show="search.checkShowTitle(nav.sport,nav.todayHighlights,nav.upcoming,nav.orderBy,"FOOTBALL - HIGHLIGHTS") && (1 || (nav.upcoming && 0))">
<span class="name">
<span class="flag-icon flag-icon-swe"></span>
Sweden - Allsvenskan
</span>
</div>
<ul class="meta">
<li class="date">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="true" show-time="false" class="ng-isolate-scope"><span class="ng-binding">09/07/18</span></timecomponent>
</li>
<li class="time">
<timecomponent datetime="'2018-07-09T20:00:00+03:00'" show-date="false" show-time="true" class="ng-isolate-scope"><span class="ng-binding">20:00</span></timecomponent>
</li>
<li class="game-id"><span class="gameid">GameID:</span> 2087</li>
</ul>
<ul class="teams">
<li>Hammarby</li>
<li>Ostersunds</li>
</ul>
<ul class="bet-selector">
<li class="pick01" id="b499795664">
<a data-id="499795664" ng-click="bets.pick($event, 499795664, 2087, 2.10)" class="betting-button pick-button " title="Hammarby">
<span class="team">Hammarby</span>
<span class="odd">2.10</span>
</a>
</li> <li class="pick0X" id="b499795666">
<a data-id="499795666" ng-click="bets.pick($event, 499795666, 2087, 3.56)" class="betting-button pick-button " title="Draw">
<span class="team">Draw</span>
<span class="odd">3.56</span>
</a>
</li> <li class="pick02" id="b499795668">
<a data-id="499795668" ng-click="bets.pick($event, 499795668, 2087, 3.40)" class="betting-button pick-button " title="Ostersunds">
<span class="team">Ostersunds</span>
<span class="odd">3.40</span>
</a>
</li> </ul>
<ul class="extra-picks">
<li>
<a class="betting-button " href="/games/1390856/markets?league=0&top=0&sid=2087&sportId=1">
<span class="short-desc">+13</span>
<span class="long-desc">View 13 more markets</span>
</a>
</li>
</ul>
<div class="game-stats">
<img class="img-responsive" src="/img/chart-icon.png?v2.2.25.2">
</div>
<div class="clear"></div>
</div>
.............................................................
parser.py
import requests
import urllib2
from bs4 import BeautifulSoup as soup
udata = urllib2.urlopen('https://www.sportpesa.co.ke/?sportId=1')
htmlsource = udata.read()
ssoup = soup(htmlsource,'html.parser')
page_div = ssoup.findAll("div",{"class":"match football FOOTBALL - HIGHLIGHTS"})
print page_div
"match football FOOTBALL - HIGHLIGHTS" is a dynamic class so you are just getting a blank list....
Here is my code in python3
from bs4 import BeautifulSoup as bs4
import requests
request = requests.get('https://www.sportpesa.co.ke/?sportId=1')
soup = bs4(request.text, 'lxml')
print(soup)
After printing soup you will find that this class is not present in your source code... I hope that it will help you
So - as suggested in the comment - the best (fastest) way to get data from this site is to make use of the same endpoint, that the javascript uses.
If you use Chrome, pop up the Inspector Tool, open the networks tab and load the page. You'll see, that se site gets the data from a url, that looks very similar to the one actually displayed in the url, namely
https://sportpesa.co.ke/sportgames?sportId=1
This endpoint gives you the data you need. To grab it using requests and getting the divs, you want, would be done like below:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://sportpesa.co.ke/sportgames?sportId=1")
soup = BeautifulSoup(r.text, "html.parser")
page_divs = soup.select('div.match.football.FOOTBALL.-.HIGHLIGHTS')
print(len(page_divs))
That prints 30 - which is the number of divs. Btw I'm using the bs4-method select here, which is the bs4-recommended way of doing things, when you - as you do here - have not just one but multiple classes ('match', 'football', 'FOOTBALL', '-' and 'HIGHLIGHTS').

Python Beautiful soup to scrape urls from a web page

I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Just put there a link in the scraper replacing some_link and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Another question. This url is not complete, I need to combine them
with "http://www.abcde.com" so the result will be
"http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but
how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])

Categories

Resources