BeautifulSoup how to use for loops and extract specific data? - python

The HTML code below is from a website regarding movie reviews. I want to extract the Stars from the code below, which would be John C. Reilly, Sarah Silverman and Gal Gadot. How could I do this?
Code:
html_doc = """
<html>
<head>
</head>
<body>
<div class="credit_summary_item">
<h4 class="inline">Stars:</h4>
John C. Reilly,
Sarah Silverman,
Gal Gadot
<span class="ghost">|</span>
See full cast & crew »
</div>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
My idea
I was going to use for loops to iterate through each div class until I found the class with text Stars, in which I could then extract the names. But I don't how I would code this as I am not too familiar with HTML syntax nor the module.

You can iterate over all a tags in the credit_summary_item div:
from bs4 import BeautifulSoup as soup
*results, _ = [i.text for i in soup(html_doc, 'html.parser').find('div', {'class':'credit_summary_item'}).find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Edit:
_d = [i for i in soup(html_doc, 'html.parser').find_all('div', {'class':'credit_summary_item'}) if 'Stars:' in i.text][0]
*results, _ = [i.text for i in _d.find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

I will show how to implement this, and see that you only need to learn BeautifulSoap syntax.
First, we want to use that method findAll for the "div" tag with the attribute "class".
divs = soup.findAll("div", attrs={"class": "credit_summary_item"})
Then, we will filter all the divs without stars in it:
stars = [div for div in divs if "Stars:" in div.h4.text]
If you have only one place with start you can take it out:
star = start[0]
Then again find all the text in tag "a"
names = [a.text for a in star.findAll("a")]
You can see that I didn't used any html/css syntax, only soup.
I hope it helped.

You can also use regex
stars = soup.findAll('a', href=re.compile('/name/nm.+'))
names = [x.text for x in stars]
names
# output: ['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

Related

Access ’aria-label‘ of Yelp review page using BeautifulSoup

I want to access the text ("5 star rating") in the 'aria label' with BeautifulSoup
<div class="lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-5__373c0__N5JxY border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK" aria-label="5 star rating" role="img"><img class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132" height="560" alt=""></div>
When I use soup.find_all('div',attrs={'class':' aria-label'}), it returns an empty list.
Can someone please help me with it?
Here aria-label is not a class it's a attribute of div tag, So u need to access that.
from bs4 import BeautifulSoup
s = """<div class="lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-5__373c0__N5JxY border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK" aria-label="5 star rating" role="img"><img class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132" height="560" alt=""></div>
"""
soup = BeautifulSoup(s, "html.parser")
print(soup.div["aria-label"])

BeautifulSoup: finding nested tag

i am quite stuck with this:
<span>Alpha<span class="class_xyz">Beta</span></span>
I am trying to scrape only the first span text "Alpha" (excluding the second nested "Beta").
How would you do that?
I am trying to write a function to find all the Span tags without a class attribute, but something is not working...
Thanks.
One way to handle it:
from bs4 import BeautifulSoup as bs
txt = """<doc>
<span>Alpha<span class="class_xyz">Beta</span></span>
</doc>"""
soup = bs(txt,'lxml')
target = soup.select_one('span[class]')
target.decompose()
soup.text.strip()
Output:
'Alpha'
Here is another way that get the text for every Span tag without a class attribute:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
target = soup.select("span")
out = []
for i in range(len(target)):
out.append(target[i].text.strip())
print(out)
Output:
['Alpha', 'Gamma', 'Epsilon']
Or if you want the whole span tag:
from bs4 import BeautifulSoup
html = """
<body>
<p>Some random text</p>
<span>Alpha<span class="class_xyz">Beta</span></span>
<span>Gamma<span class="class_abc">Delta</span></span>
<span>Epsilon<span class="class_lmn">Zeta</span></span>
</body>
"""
soup = BeautifulSoup(html)
target = soup.select("span[class]")
for i in range(len(target)):
target[i].decompose()
out = soup.select("span")
print(out)
Output:
[<span>Alpha</span>, <span>Gamma</span>, <span>Epsilon</span>]

BeautifulSoup - Given ID how to extract other categories in div?

Using the code below extracts div that contains the number of votes on a website
votes = soup.find('div', {'id': 'vote_text_299159'})
<div id="vote_text_247231" class="left action_unclicked_show cursor" style="width:97px;margin-left:3px;" data-counttext="191 votes" data-actiontext="VOTE!" onmouseover="user_action_button_mouseover('vote','247231')" onmouseout="user_action_button_mouseout('vote', '247231')" onclick="loginPrompt('vote');clickTrack("photo_vote");">
Which returns the div shown below the code.
How can I extract data-counttext (or other categories) given this return and just using the id?
I believe the get('attribute_name') function works here. In your case 'attribute_name' would be 'data-counttext'. Code below.
votes = soup.find('div', {'id': 'vote_text_299159'})
dataCountText = votes.get('data-counttext')
Here is regular expression approach :
import re
pattern=r'data-counttext="(\w.+?)"'
string_1="""<div id="vote_text_247231" class="left action_unclicked_show cursor" style="width:97px;margin-left:3px;" data-counttext="191 votes" data-actiontext="VOTE!" onmouseover="user_action_button_mouseover('vote','247231')" onmouseout="user_action_button_mouseout('vote', '247231')" onclick="loginPrompt('vote');clickTrack("photo_vote");">"""
match=re.search(pattern,string_1)
print(match.group(1))
output:
191 votes
You can use tag.attrs to get all the attributes of a tag.
from bs4 import BeautifulSoup as bs
s = """<div id="vote_text_247231" class="left action_unclicked_show cursor" style="width:97px;margin-left:3px;" data-counttext="191 votes" data-actiontext="VOTE!" onmouseover="user_action_button_mouseover('vote','247231')" onmouseout="user_action_button_mouseout('vote', '247231')" onclick="loginPrompt('vote');clickTrack("photo_vote");">"""
soup = bs(s, "lxml")
votes = soup.find("div", {"id": "vote_text_247231"})
votes.attrs
output
{'class': ['left', 'action_unclicked_show', 'cursor'],
'data-actiontext': 'VOTE!',
'data-counttext': '191 votes',
'id': 'vote_text_247231',
'onclick': 'loginPrompt(\'vote\');clickTrack("photo_vote");',
'onmouseout': "user_action_button_mouseout('vote', '247231')",
'onmouseover': "user_action_button_mouseover('vote','247231')",
'style': 'width:97px;margin-left:3px;'}

Python: How to extract URL from HTML Page using BeautifulSoup?

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?
According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.
After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>
from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Python BeautifulSoup parsing

I am trying to scrape some content (am very new to Python) and I have hit a stumbling block. The code I am trying to scrape is:
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22"</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
What I am trying to scrape is the link text (Spear & Jackson etc) and the price (£5.95). I have looked about on Google, the BeautifulSoup documentation and on this forum and I managed to get to extract the "Now: £5.95" using this code:
for node in soup.findAll('p', { "class" : "productlist_grid_price" }):
print ''.join(node.findAll(text=True))
However the result I am after is just 5.95. I have also had limited success trying to get the link text (Spear & Jackson) using:
soup.h2.a.contents[0]
However of course this returns just the first result.
The ultimate result that I am aiming for is to have the results look like:
Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc
As I am looking to export this to a csv, I need to figure out how to put the data into 2 columns. Like I say I am very new to python so I hope this makes sense.
I appreciate any help!
Many thanks
I think what you're looking for is something like this:
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)
print item, price
Example output:
Spear & Jackson Predator Universal Hardpoint Saw - 22" 5.95
Note that for the item, the regular expression is used just to remove extra whitespace, while for the price is used to capture the number.
html = '''
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
'''
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())
print desc, price

Categories

Resources