Python BeautifulSoup parsing - python

I am trying to scrape some content (am very new to Python) and I have hit a stumbling block. The code I am trying to scrape is:
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22"</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
What I am trying to scrape is the link text (Spear & Jackson etc) and the price (£5.95). I have looked about on Google, the BeautifulSoup documentation and on this forum and I managed to get to extract the "Now: £5.95" using this code:
for node in soup.findAll('p', { "class" : "productlist_grid_price" }):
print ''.join(node.findAll(text=True))
However the result I am after is just 5.95. I have also had limited success trying to get the link text (Spear & Jackson) using:
soup.h2.a.contents[0]
However of course this returns just the first result.
The ultimate result that I am aiming for is to have the results look like:
Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc
As I am looking to export this to a csv, I need to figure out how to put the data into 2 columns. Like I say I am very new to python so I hope this makes sense.
I appreciate any help!
Many thanks

I think what you're looking for is something like this:
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)
print item, price
Example output:
Spear & Jackson Predator Universal Hardpoint Saw - 22" 5.95
Note that for the item, the regular expression is used just to remove extra whitespace, while for the price is used to capture the number.

html = '''
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
'''
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())
print desc, price

Related

BeautifulSoup: Extracting text from nested tags

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.
I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column. In this case, it's albums that I'd like to create a table of by who mastered the album, produced the album, etc.
Given a snippet of html:
from https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
...
<li>
<span class="entity_1XpR8">Recorded By</span>
" – "
EMI Studios, Paris
</li>
<li>
<span class="entity_1XpR8">Mastered At</span>
" – "
Sterling Sound
</li>
etc etc etc
...
My code so far is something like:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
kw = "Mastered At"
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
x = soup.find_all('span', string='Mastered At')
results.append((url, x))
print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')
With some modifications based on comments below, still having issues:
As you can see I'm trying to do this within a for loop for each link in a list.
The URL list is a simple text file with separate lines for each. Since I'm scraping only one website the sources, class names, and etc should be the same, but the dish will change from page to page.
ex URL list:
https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
https://www.discogs.com/release/3872976-Pink-Floyd-The-Wall
... etc etc etc
updated code snippet:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
print(url)
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Mastered At']:
results.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(),
x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(results, columns=['Url', 'Mastered At', 'Studio'])
print(df)
df.to_csv('studios.csv')
I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Mastered At — Sterling Sound" (or just "Sterling Sound"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:
example:
album url | Sterling Sound
album url | Abbey Road
album url | Abbey Road
album url | Sterling Sound
album url | Real World Studios
album url | EMI Studios, Paris
album url | Sterling Sound
etc etc etc
Thanks for your help!
Cheers.
The Beautiful Soup library is best suited for this task.
You can use the following code to extract data:
import requests, lxml
from bs4 import BeautifulSoup
# urls.html would be better
with open("urls.txt") as file:
src = file.read()
soup = BeautifulSoup(src, 'lxml')
for first, second in zip(soup.select("li span"), soup.select("li a")):
print(first)
print(second)
To find the desired selector, you can use the select() bs4 method. This method accepts a selector to search for and returns a list of all matched HTML elements.
In this case, I use the zip() built-in function, which allows you to go through two structures at once in one cycle.
Then you can use the data for your tasks.
BeautifulSoup can use different parsers for html. If you have issues with lxml you can try others, like html.parser. You can try the following code, it will create a dataframe from your data, which can then be further saved to csv or other formats:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<li>
<span class = "spClass">Breakfast</span> " — "
Pancakes
</li>
<li>
<span class = "spClass">Lunch</span> " — "
Sandwiches
</li>
<li>
<span class = "spClass">Dinner</span> " — "
Stew
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in soup.select('li'):
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
print(df) ## you can save the dataframe as csv like so: df.to_csv('foods.csv')
Result:
Url Food Type
0 /examplepage/Pancakes Pancakes Breakfast
1 /examplepage/Sandwiches Sandwiches Lunch
2 /examplepage/Stew Stew Dinner
EDIT: If you only want to extract specific li tags, as per your comment, you can do:
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Dinner']:
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
And this will return:
Url Food Type
0 /examplepage/Stew Stew Dinner

BeautifulSoup how to use for loops and extract specific data?

The HTML code below is from a website regarding movie reviews. I want to extract the Stars from the code below, which would be John C. Reilly, Sarah Silverman and Gal Gadot. How could I do this?
Code:
html_doc = """
<html>
<head>
</head>
<body>
<div class="credit_summary_item">
<h4 class="inline">Stars:</h4>
John C. Reilly,
Sarah Silverman,
Gal Gadot
<span class="ghost">|</span>
See full cast & crew »
</div>
</body>
</html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
My idea
I was going to use for loops to iterate through each div class until I found the class with text Stars, in which I could then extract the names. But I don't how I would code this as I am not too familiar with HTML syntax nor the module.
You can iterate over all a tags in the credit_summary_item div:
from bs4 import BeautifulSoup as soup
*results, _ = [i.text for i in soup(html_doc, 'html.parser').find('div', {'class':'credit_summary_item'}).find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
Edit:
_d = [i for i in soup(html_doc, 'html.parser').find_all('div', {'class':'credit_summary_item'}) if 'Stars:' in i.text][0]
*results, _ = [i.text for i in _d.find_all('a')]
Output:
['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']
I will show how to implement this, and see that you only need to learn BeautifulSoap syntax.
First, we want to use that method findAll for the "div" tag with the attribute "class".
divs = soup.findAll("div", attrs={"class": "credit_summary_item"})
Then, we will filter all the divs without stars in it:
stars = [div for div in divs if "Stars:" in div.h4.text]
If you have only one place with start you can take it out:
star = start[0]
Then again find all the text in tag "a"
names = [a.text for a in star.findAll("a")]
You can see that I didn't used any html/css syntax, only soup.
I hope it helped.
You can also use regex
stars = soup.findAll('a', href=re.compile('/name/nm.+'))
names = [x.text for x in stars]
names
# output: ['John C. Reilly', 'Sarah Silverman', 'Gal Gadot']

Further Probing in BeautifulSoup

so i'm pretty new to BeautifulSoup and web scraping in general. I am currently running the code:
attraction_names_full = soup.find_all('td', class_='alt2', align = 'right', height = '28')
Which is returning a list comprising of objects which look like this:
<td align="right" class="alt2" height="28">
A Pirate's Adventure - Treasures of the Seven Seas
<br/>
<span style="font-size: 9px; color: #627DAD; font-style: italic;">
12:00pm to 6:00pm
</span>
</td>
What I am trying to get from this is just the line containing the text, which in this case would be
A Pirate's Adventure - Treasures of the Seven Seas
however I'm not sure how to go about this as it doesn't seem to have any tags surrounding just the text.
I have attempted to see if I can interact with the elements as strings however the object type seems to be:
<class 'bs4.element.Tag'>
Which i'm not sure how to manipulate and am sure there must be a much more efficient way of achieving this.
Any ideas on how to achieve this? - For reference the webpage i'm looking at is
url = 'https://www.thedibb.co.uk/forums/wait_times.php?a=MK'
You could extract the <span> element and then get the stripped text as follows:
from bs4 import BeautifulSoup
import requests
html = requests.get('https://www.thedibb.co.uk/forums/wait_times.php?a=MK').content
soup = BeautifulSoup(html, "html.parser")
for attraction in soup.find_all('td', class_='alt2', align='right', height='28'):
attraction.span.extract()
print attraction.get_text(strip=True)
Which would give you output starting:
A Pirate's Adventure - Treasures of the Seven Seas
Captain Jack Sparrow's Pirate Tutorial
Jungle Cruise
Meet Characters from Aladdin in Adventureland
Pirates of the Caribbean
Swiss Family Treehouse
html = urllib.request.urlopen("https://www.thedibb.co.uk/forums/wait_times.php?a=MK").read()
soup = BeautifulSoup(html, 'html.parser')
listElem = list(soup.find_all('td', class_='alt2', align = 'right', height = '28'))
print(listElem[1].contents[0])
You can use .contents , it works for me, the output is "Captain Jack Sparrow's Pirate Tutorial"

Scrapy or BeautifulSoup to scrape links and text from various websites

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')
You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)
code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Python: How to extract URL from HTML Page using BeautifulSoup?

I have a HTML Page with multiple divs like
<div class="article-additional-info">
A peculiar situation arose in the Supreme Court on Tuesday when two lawyers claimed to be the representative of one of the six accused in the December 16 gangrape case who has sought shifting of t...
<a class="more" href="http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece">
<span class="arrows">»</span>
</a>
</div>
<div class="article-additional-info">
Power consumers in the city will have to brace for spending more on their monthly bills as all three power distribution companies – the Anil Ambani-owned BRPL and BYPL and the Tatas-owned Tata Powe...
<a class="more" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece">
<a class="commentsCount" href="http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments">
</div>
and I need to get the <a href=> value for all the divs with class article-additional-info
I am new to BeautifulSoup
so I need the the urls
"http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece"
"http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece"
What is the best way to achieve this?
According to your criteria, it returns three URLs (not two) - did you want to filter out the third?
Basic idea is to iterate through the HTML, pulling out only those elements in your class, and then iterating through all of the links in that class, pulling out the actual links:
In [1]: from bs4 import BeautifulSoup
In [2]: html = # your HTML
In [3]: soup = BeautifulSoup(html)
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
This limits your search to just those elements with the article-additional-info class tag, and inside of there looks for all anchor (a) tags and grabs their corresponding href link.
After working with the documentation, I did it the following way, thank you all for your answers, I appreciate them
>>> import urllib2
>>> f = urllib2.urlopen('http://www.thehindu.com/news/cities/delhi/?union=citynews')
>>> soup = BeautifulSoup(f.fp)
>>> for link in soup.select('.article-additional-info'):
... print link.find('a').attrs['href']
...
http://www.thehindu.com/news/cities/Delhi/airport-metro-express-is-back/article4335059.ece
http://www.thehindu.com/news/cities/Delhi/91-more-illegal-colonies-to-be-regularised/article4335069.ece
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/nurses-women-groups-demand-safety-audit-of-workplaces/article4331470.ece
http://www.thehindu.com/news/cities/Delhi/delhi-bpl-families-to-get-12-subsidised-lpg-cylinders/article4328990.ece
http://www.thehindu.com/news/cities/Delhi/shias-condemn-violence-against-religious-minorities/article4328276.ece
http://www.thehindu.com/news/cities/Delhi/new-archbishop-of-delhi-takes-over/article4328284.ece
http://www.thehindu.com/news/cities/Delhi/delhi-metro-to-construct-subway-without-disrupting-traffic/article4328290.ece
http://www.thehindu.com/life-and-style/Food/going-for-the-kill-in-patparganj/article4323210.ece
http://www.thehindu.com/news/cities/Delhi/fire-at-janpath-bhavan/article4335068.ece
http://www.thehindu.com/news/cities/Delhi/fiveyearold-girl-killed-as-school-van-overturns/article4335065.ece
http://www.thehindu.com/news/cities/Delhi/real-life-stories-of-real-women/article4331483.ece
http://www.thehindu.com/news/cities/Delhi/women-councillors-allege-harassment-by-male-councillors-of-rival-parties/article4331471.ece
http://www.thehindu.com/news/cities/Delhi/airport-metro-resumes-today/article4331467.ece
http://www.thehindu.com/news/national/hearing-today-on-plea-to-shift-trial/article4328415.ece
http://www.thehindu.com/news/cities/Delhi/protestors-demand-change-in-attitude-of-men-towards-women/article4328277.ece
http://www.thehindu.com/news/cities/Delhi/bjp-promises-5-lakh-houses-for-poor-on-interestfree-loans/article4328280.ece
http://www.thehindu.com/life-and-style/metroplus/papad-bidi-and-a-dacoit/article4323219.ece
http://www.thehindu.com/life-and-style/Food/gharana-of-food-not-just-music/article4323212.ece
>>>
from bs4 import BeautifulSoup as BS
html = # Your HTML
soup = BS(html)
for text in soup.find_all('div', class_='article-additional-info'):
for links in text.find_all('a'):
print links.get('href')
Which prints:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments
In [4]: for item in soup.find_all(attrs={'class': 'article-additional-info'}):
...: for link in item.find_all('a'):
...: print link.get('href')
...:
http://www.thehindu.com/news/national/gangrape-case-two-lawyers-claim-to-be-engaged-by-accused/article4332680.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece
http://www.thehindu.com/news/cities/Delhi/power-discoms-demand-yet-another-hike-in-charges/article4331482.ece#comments

Categories

Resources