Further Probing in BeautifulSoup - python

so i'm pretty new to BeautifulSoup and web scraping in general. I am currently running the code:
attraction_names_full = soup.find_all('td', class_='alt2', align = 'right', height = '28')
Which is returning a list comprising of objects which look like this:
<td align="right" class="alt2" height="28">
A Pirate's Adventure - Treasures of the Seven Seas
<br/>
<span style="font-size: 9px; color: #627DAD; font-style: italic;">
12:00pm to 6:00pm
</span>
</td>
What I am trying to get from this is just the line containing the text, which in this case would be
A Pirate's Adventure - Treasures of the Seven Seas
however I'm not sure how to go about this as it doesn't seem to have any tags surrounding just the text.
I have attempted to see if I can interact with the elements as strings however the object type seems to be:
<class 'bs4.element.Tag'>
Which i'm not sure how to manipulate and am sure there must be a much more efficient way of achieving this.
Any ideas on how to achieve this? - For reference the webpage i'm looking at is
url = 'https://www.thedibb.co.uk/forums/wait_times.php?a=MK'

You could extract the <span> element and then get the stripped text as follows:
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.thedibb.co.uk/forums/wait_times.php?a=MK').content
soup = BeautifulSoup(html, "html.parser")
for attraction in soup.find_all('td', class_='alt2', align='right', height='28'):
attraction.span.extract()
print attraction.get_text(strip=True)
Which would give you output starting:
A Pirate's Adventure - Treasures of the Seven Seas
Captain Jack Sparrow's Pirate Tutorial
Jungle Cruise
Meet Characters from Aladdin in Adventureland
Pirates of the Caribbean
Swiss Family Treehouse

html = urllib.request.urlopen("https://www.thedibb.co.uk/forums/wait_times.php?a=MK").read()
soup = BeautifulSoup(html, 'html.parser')
listElem = list(soup.find_all('td', class_='alt2', align = 'right', height = '28'))
print(listElem[1].contents[0])
You can use .contents , it works for me, the output is "Captain Jack Sparrow's Pirate Tutorial"

Related

extract <label><span> tag on html with python

I would like to extract webpage like:
https://www.glassdoor.com/Overview/Working-at-Apple-EI_IE1138.11,16.htm,so I would like to return the result as the following format.
Website Headquarters Size Revenue Type
www.apple.com Cupertino, CA 10000+ employees $10+ billion (USD) per year Company - Public (AAPL)
I then use the following code with beatifulsoup to get this.
all_href = com_soup.find_all('span', {'class': re.compile('value')})
all_href = list(set(all_href))
It returns tag with <span>. Also, it didn't show tag under <label>
[<span class="value"> Computer Hardware & Software</span>,
<span class="value"> Company - Public (AAPL) </span>,
<span class="value">10000+ employees</span>,
<span class="value"> $10+ billion (USD) per year</span>,
<span class="value-title" title="4.0"></span>,
<span class="value">Cupertino, CA</span>,
<span class="value"> 1976</span>,
<span class="value-title" title="5.0"></span>,
<span class="value website"><a class="link" href="http://www.apple.com" rel="nofollow noreferrer" target="_blank">www.apple.com</a></span>]
Your beautifulsoup pull is too specific. You're catching all the "span" tags, where the class = value.
When you look at the HTML, you can find that section quickly by searching for the text of some of the fields. What you should do is get everything inside any of the div tags where class = 'infoEntity', which contains all 7 fields you're interested in grabbing from that "Overview" section.
Within that, there is a label tag for each field, which has attributes correlating to the labels you want above, and that are in that Overview section.
So, start with:
from bs4 import BeautifulSoup
data = """
<div class="eep-pill"><p class="tightVert h2 white"><strong>Enhanced</strong> Profile <span class="round ib"><i class="icon-star-white"></i></span></p></div></header><section class="center flex-grid padVertLg eepModal"><h2>Try Enhanced Profile Free for a Month</h2><p>Explore the many benefits of having a premium branded profile on Glassdoor, like increased influence and advanced analytics.</p><div class="margBot"><i class="feaIllustration"></i></div>www.apple.com</span></div><div class='infoEntity'><label>Headquarters</label><span class='value'>Cupertino, CA</span></div><div class='infoEntity'><label>Size</label><span class='value'>10000+ employees</span></div><div class='infoEntity'><label>Founded</label><span class='value'> 1976</span></div><div class='infoEntity'><label>Type</label><span class='value'> Company - Public (AAPL) </span></div><div class='infoEntity'><label>Industry</label><span class='value'> Computer Hardware & Software</span></div><div class='infoEntity'><label>Revenue</label><span class='value'> $10+ billion (USD) per year</span></div></div></div><div class=''><div data-full="We&rsquo;re a diverse collection of thinkers and doers, continually reimagining what&rsquo;s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with services, including iTunes, the App Store, Apple Music, and Apple Pay. And the same passion for innovation that goes into our products also applies to our practices &mdash; strengthening our commitment to leave the world better than we found it." class='margTop empDescription'> We’re a diverse collection of thinkers and doers, continually reimagining what’s possible to help us all do what we love in new ways. The people who work here have reinvented entire industries with the Mac, iPhone, iPad, and Apple Watch, as well as with ... <span class='link minor moreLink' id='ExpandDesc'>Read more</span></div><div class='hr'><hr/></div><h3 class='margTop'>Glassdoor Awards</h3>
"""
items = []
soup = BeautifulSoup(data, 'lxml')
get_info = iter(soup.find_all("div", {"class" : "infoEntity"}))
for item in get_info:
label = item.find("label")
value = item.find("span")
items.append((label.string, value.string))
With that, you get a list of tuples in items, that prints out as:
[('Website', 'www.apple.com'), ('Headquarters', 'Cupertino, CA'), ('Size', '10000+ employees'), ('Founded', ' 1976'), ('Type', ' Company - Public (AAPL) '), ('Industry', ' Computer Hardware & Software'), ('Revenue', ' $10+ billion (USD) per year')]
From there, you can print out that list in any format you like.
As I notice in https://www.glassdoor.com/Overview/Working-at-Apple-EI_IE1138.11,16.htm
You should find the <div class="infoEntity"> instead of <span class="value"> so as to get what you want.
all_href = com_soup.find_all('div', {'class': re.compile('infoEntity')}).find_all(['span','label'])
all_href = list(set(all_href))
It will returns you all <span> and <label> you want.
What if you want to have <span> and <label> come together, than change it to
all_href = [x.decode_contents(formatter="html") for x in com_soup.find_all('div', {'class': re.compile('infoEntity')})]
#or
all_href = [[x.find('span'), x.find('label')] for x in com_soup.find_all('div', {'class': re.compile('infoEntity')})]

BeautifulSoup - return header correspondent to matched footer

I'm using Beautifulsoup to retrieve an artist name from a blog, given a specific match of music tags:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://musicblog.kms-saulgau.de/tag/chillout/')
html = r.content
soup = BeautifulSoup(html, 'html.parser')
Artist names are stored here:
header = soup.find_all('header', class_= "entry-header")
and artist tags here:
span = soup.find_all('span', class_= "tags-links")
I can get all headers:
for each in header:
if each.find("a"):
each = each.find("a").get_text()
print each
And then I'm looking up for 'alternative' and 'chillout' in the same footer:
for each in span:
if each.find("a"):
tags = each.find("a")["href"]
if "alternative" in tags:
print each.get_text()
the code, so far, prints:
Terra Nine – The Heart of the Matter
Emmit Fenn – Blinded
Amparo – The Orchid Glacier
Alpha Minus – Satellites
Carbonates on Mars – The Song of Sol
Josey Marina – Ocean Sighs
Sunday – Only
Some Kind Of Illness – The Light
Vesna Kazensky – Raven
James Lowe – Shallow
Tags Alternative, Chillout, Indie Rock, New tracks
but what I'm trying to do is to return only the entry correspondent to the matched footer, like so:
Some Kind Of Illness – The Light
Alternative, Chillout, Indie Rock, New tracks
how can I achieve that?
for article in soup.find_all('article'):
if article.select('a[href*="alternative"]') and article.select('a[href*="chillout"]'):
print(article.h2.text)
print(article.find(class_='tags-links').text)
out:
Some Kind Of Illness – The Light
Tags Alternative, Chillout, Indie Rock, New tracks

Scrape only selected text from tables using Python/Beautiful soup/pandas

I am new to Python and am using beautiful soup for web scraping for a project.
I am hoping to only get parts of the text in a list/dictionary. I started with the following code:
url = "http://eng.mizon.co.kr/productlist.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
tables = soup.find_all('table')
This helped me parse data into tables and ONE of the items from table looked as below:
<table border="0" cellpadding="0" cellspacing="0" width="235">
<tr>
<td align="center" height="238"><img alt="LL IN ONE SNAIL REPAIR CREAM, SNAIL REPAIR BLEMISH BALM, WATERMAX MOISTURE B.B CREAM, WATERMAX AQUA GEL CREAM, CORRECT COMBO CREAM, GOLD STARFISH ALL IN ONE CREAM, S-VENOM WRINKLE TOX CREAM, BLACK SNAIL ALL IN ONE CREAM, APPLE SMOOTHIE PEELING GEL, REAL SOYBEAN DEEP CLEANSING OIL, COLLAGEN POWER LIFTING CREAM, SNAIL RECOVERY GEL CREAM" border="0" src="http://www.mizon.co.kr/images/upload/product/20150428113514_3.jpg" width="240"/></td>
</tr>
<tr>
<td align="center" height="43" valign="middle"><a href="javascript:fnMoveDetail(7499)" onfocus="this.blur()"><span class="style3">ENJOY VITAL-UP TIME Lift Up Mask <br/>
Volume:25ml</span></a></td>
</tr>
</table>
For each such item in the table, I would like to extract only the following from the last data cell in table above:
1) The four digit number in a href = javascript:fnMoveDetail(7499)
2) Name under class:style3
3) volume under class:style3
The next lines in my code were as follows:
df = pd.read_html(str(tables), skiprows={0}, flavor="bs4")[0]
a_links = soup.find_all('a', attrs={'class':'style3'})
stnid_dict = {}
for a_link in a_links:
cid = ((a_link['href'].split("javascript:fnMoveDetail("))[1].split(")")[0])
stnid_dict[a_link.text] = cid
My objective is to use the numbers to go to individual links and then match the info scraped on this page to each link.
What would be the best way to approach this?
use a tag which contains javascript href as anchor, find all span and then get it's parent tag.
url = "http://eng.mizon.co.kr/productlist.asp"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
spans = soup.select('td > a[href*="javascript:fnMoveDetail"] > span')
for span in spans:
href = span.find_parent('a').get('href').strip('javascript:fnMoveDetail()')
name, volume = span.get_text(strip=True).split('Volume:')
print(name, volume, href)
out:
Dust Clean up Peeling Toner 150ml 8235
Collagen Power Lifting EX Toner 150ml 8067
Collagen Power Lifting EX Emulsion 150ml 8068
Barrier Oil Toner 150ml 8059
Barrier Oil Emulsion 150ml 8060
BLACK CLEAN UP PORE WATER FINISHER 150ml 7650
Vita Lemon Sparkling Toner 150ml 7356
INTENSIVE SKIN BARRIER TONER 150ml 7110
INTENSIVE SKIN BARRIER EMULSION 150ml 7111

python finding index of tag in string

HTML
<div class="productDescriptionWrapper">
<p>A worm worth getting your hands dirty over. With over six feet of crawl space, Playhut’s Wiggly Worm is a brightly colored and friendly play structure.
</p>
<ul>
<li>6ft of crawl through fun</li>
<li>18” diameter for easy crawl through</li>
<li>Bright colorful design</li>
<li>Product Measures: 18""Diam x 60""L</li>
<li>Recommended Ages: 3 years & up<br /> </li>
</ul>
<p><strong>Intended for Indoor Use</strong></p>
Code
def GetBullets(self, Soup):
bulletList = []
bullets = str(Soup.findAll('div', {'class': 'productDescriptionWrapper'}))
bullets_re = re.compile('<li>(.*)</li>')
bullets_pat = str(re.findall(bullets_re, bullets))
index = bullets_pat.findall('</li>')
print index
how to extract p tags and li tags? Thanks!
Notice the following:
>>> from BeautifulSoup import BeautifulSoup
>>> html = """ <what you have above> """
>>> Soup = BeautifulSoup(html)
>>> bullets = Soup.findAll('div', {'class': 'productDescriptionWrapper'})
>>> ptags = bullets[0].findAll('p')
>>> print ptags
[<p>A worm worth getting your hands dirty over. With over six feet of crawl space, Playhut’s Wiggly Worm is a brightly colored and friendly play structure.
</p>, <p><strong>Intended for Indoor Use</strong></p>]
>>> print ptags[0].text
A worm worth getting your hands dirty over. With over six feet of crawl space, Playhut’s Wiggly Worm is a brightly colored and friendly play structure.
You can get at the contents of your li tags in a similar manner.
We use Beautiful Soup for this.

Python BeautifulSoup parsing

I am trying to scrape some content (am very new to Python) and I have hit a stumbling block. The code I am trying to scrape is:
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22"</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
What I am trying to scrape is the link text (Spear & Jackson etc) and the price (£5.95). I have looked about on Google, the BeautifulSoup documentation and on this forum and I managed to get to extract the "Now: £5.95" using this code:
for node in soup.findAll('p', { "class" : "productlist_grid_price" }):
print ''.join(node.findAll(text=True))
However the result I am after is just 5.95. I have also had limited success trying to get the link text (Spear & Jackson) using:
soup.h2.a.contents[0]
However of course this returns just the first result.
The ultimate result that I am aiming for is to have the results look like:
Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc
As I am looking to export this to a csv, I need to figure out how to put the data into 2 columns. Like I say I am very new to python so I hope this makes sense.
I appreciate any help!
Many thanks
I think what you're looking for is something like this:
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)
print item, price
Example output:
Spear & Jackson Predator Universal Hardpoint Saw - 22" 5.95
Note that for the item, the regular expression is used just to remove extra whitespace, while for the price is used to capture the number.
html = '''
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
'''
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())
print desc, price

Categories

Resources