unknown error in beautifulsoup web scraping using python

unknown error in beautifulsoup web scraping using python - python

<div id="browse_in_widget">
<span id="browse_in_breadcrumb" style="width: 583px;">
<div class="seo_itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" itemscope="">
<a itemprop="url" href="/search/"> Arabian Area</a>
<span class="seo_itemprop-title" itemprop="title">Arabian Area</span>
>
</div>
<div class="seo_itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" itemscope="">
<a itemprop="url" href="/property-for-rent/home/"> Phase 2 </a>
<span class="seo_itemprop-title" itemprop="title">Phase 2 </span>
>
</div>
<div class="seo_itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" itemscope="">
<a itemprop="url" href="/property-for-rent/residential/"> Residential Units for Rent </a>
<span class="seo_itemprop-title" itemprop="title">Residential Units for Rent</span>
>
</div>
<div class="seo_itemscope" itemtype="http://data-vocabulary.org/Breadcrumb" itemscope="">
<a itemprop="url" href="/property-for-rent/residential/apartmentflat/"> Apartment/Flat for Rent </a>
<span class="seo_itemprop-title" itemprop="title">Apartment/Flat for Rent</span>
>
</div>
<strong class="seo_itemprop-title" itemprop="title">Details</strong>
</span>
</div>
I want to get
['Arabian Area', 'Phase 2', 'Residential Units for Rent','Apartment/Flat for Rent']
I am trying to use the following code using beautiful 4 python
try:
Type = [str(Area.text) for Area in soup.find_all("span", {"class" : "seo_itemscope"})]
Area=' , '.join(Area)
print Area
except StandardError as e:
Area="Error was {0}".format(e)
print Area
All i want is to get the desired output in a list but there seems to be some problem. I am not getting any print. What can be the problem?
Thank you!

The first problem is that you are looking for span elements with seo_itemscope class which don't exist. Use seo_itemprop-title if you are looking for the titles:
Type = [item.get_text() for item in soup.find_all("span", {"class": "seo_itemprop-title"})]
The other problem is here:
Area=' , '.join(Area)
You meant to join items of the Type list instead:
Area = ' , '.join(Type)
And, it is not a good idea to catch the StandardError - it is too broad of an exception and actually is close to having a bare except clause. You should catch more specific exceptions, see:
Should I always specify an exception type in `except` statements?

Related

Beautiful Soup - finding all classes which contain a know strin

I was trying to extract the string '£150,000' from this HTML code by identifying the string 'Purchase Price' within the class since the same class is used more than once
<div class="row mb-sm-1 property-header-row">
<div class="prop-capital-fields property-header-col col-6"><h3>£150,000</h3>
<p class="label-paragraph">
Purchase Price
</p>
</div>
<div class="prop-capital-fields property-header-col col-6"><h3>£180,000</h3>
<p class="label-paragraph">
Market Value
</p>
</div>
<div class="prop-capital-fields property-header-col col-6"><h3>£1,185</h3>
<p class="label-paragraph">
Potential Cashflow PCM
</p>
</div>
So I wrote the following code
property_ = soup.find(class_="properties-content-body col-xs-12 col-sm-12 col-md-7")
for a in property_.find_all('div', attrs={'class': 'prop-capital-fields property-header-col col-6'}, text="Purchase Price"):
purchase_price_list.append(a)
print(purchase_price_list)
but all I get is a blank list
I've tried many other things but I'm pretty sure I just don't know the correct way to do it.
Any help is appreciated.

I've found the answer:
for a in property_.find_all('div', attrs={'class': 'prop-capital-fields property-header-col col-6'}):
b = a.find('p').text.replace("\n", "").strip()
c = a.find('h3').text.strip()
if(b=='Purchase Price'):
purchase_price_list.append(c)

Can't get the xml element value using lxml xpath

I am trying to scrape a spotify playlist webpage to pull out artist and song name data. Here is my python code:
#! /usr/bin/python
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
print("\n\nprinting variable playListPage: " + str(playlistPage))
tree = html.fromstring(playlistPage.content)
print("printing variable tree: " + str(tree))
artistList = tree.xpath("//span/a[#class='tracklist-row__artist-name-link']/text()")
print("printing variable artistList: " + str(artistList) + "\n\n")
Right now the final print statement is printing out an empty list.
Here is some example HTML from the page I'm trying to scrape. Ideally my code would pull out the string "M83"...not sure how much html is relevant so pasting what I believe necessary:
<div class="react-contextmenu-wrapper">
<div draggable="true">
<li class="tracklist-row" role="button" tabindex="0" data-testid="tracklist-row">
<div class="tracklist-col position-outer">
<div class="tracklist-play-pause tracklist-top-align">
<svg class="icon-play" viewBox="0 0 85 100">
<path fill="currentColor" d="M81 44.6c5 3 5 7.8 0 10.8L9 98.7c-5 3-9 .7-9-5V6.3c0-5.7 4-8 9-5l72 43.3z">
<title>
PLAY</title>
</path>
</svg>
</div>
<div class="position tracklist-top-align">
<span class="spoticon-track-16">
</span>
</div>
</div>
<div class="tracklist-col name">
<div class="track-name-wrapper tracklist-top-align">
<div class="tracklist-name ellipsis-one-line" dir="auto">
Intro</div>
<div class="second-line">
<span class="TrackListRow__artists ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__artist-name-link" href="/artist/63MQldklfxkjYDoUE4Tppz">
M83</a>
</span>
</span>
</span>
<span class="second-line-separator" aria-label="in album">
•</span>
<span class="TrackListRow__album ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__album-name-link" href="/album/6R0ynY7RF20ofs9GJR5TXR">
Hurry Up, We're Dreaming</a>
</span>
</span>
</span>
</div>
</div>
</div>
<div class="tracklist-col more">
<div class="tracklist-top-align">
<div class="react-contextmenu-wrapper">
<button class="_2221af4e93029bedeab751d04fab4b8b-scss c74a35c3aba27d72ee478f390f5d8c16-scss" type="button">
<div class="spoticon-ellipsis-16">
</div>
</button>
</div>
</div>
</div>
<div class="tracklist-col tracklist-col-duration">
<div class="tracklist-duration tracklist-top-align">
<span>
5:22</span>
</div>
</div>
</li>
</div>
</div>

A solution using Beautiful Soup:
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
soup = bs(page.content, 'lxml')
tracklist_container = soup.find("div", {"class": "tracklist-container"})
track_artists_container = tracklist_container.findAll("span", {"class": "artists-albums"})
artists = []
for ta in track_artists_container:
artists.append(ta.find("span").text)
print(artists[0])
prints
M83
This solution gets all the artists on the page so you could print out the list artists and get:
['M83',
'Charles Bradley',
'Bon Iver',
...
'Death Cab for Cutie',
'Destroyer']
And you can extend this to track names and albums quite easily by changing the classname in the findAll(...) function call.

Nice answer provided by #eNc. lxml solution :
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
tree = html.fromstring(playlistPage.content)
artistList = tree.xpath("//span[#class='artists-albums']/a[1]/span/text()")
print(artistList)
Output :
['M83', 'Charles Bradley', 'Bon Iver', 'The Middle East', 'The Antlers', 'Handsome Furs', 'Frank Turner', 'Frank Turner', 'Amy Winehouse', 'Black Lips', 'M83', 'Florence + The Machine', 'Childish Gambino', 'DJ Khaled', 'Kendrick Lamar', 'Future Islands', 'Future Islands', 'JAY-Z', 'Blood Orange', 'Cut Copy', 'Rihanna', 'Tedeschi Trucks Band', 'Bill Callahan', 'St. Vincent', 'Adele', 'Beirut', 'Childish Gambino', 'David Guetta', 'Death Cab for Cutie', 'Destroyer']
Since you can't get all the results in one shot, maybe you should switch for Selenium.

Python Beautifulsoup - problem read <span>

I try to extract "brand-logo", "product-name", "price" and "best-price " from the following HTML:
<div class="container">
<div class="catalog-wrapper">
<div class="slideout-filters"></div>
<section class="catalog-top-banner"></section>
<section class="search-results">
<section class="catalog">
<div class="row">
<div class="col-xs-12 col-md-4 col-lg-3">
<div class="col-xs-12 col-md-8 col-lg-9">
<div class="catalog-container">
<a class="catalog-product catalog-item ">
<div class="product-image "></div>
<div class="product-description">
<div>
<div class="brand-logo">
<span>PACO RABANNE</span>
</div>
<span class="product-name">
PACO RABANNE PERFUME MUJER 30 ML
</span>
<span class="price">Normal: S/ 219</span>
<span class="best-price ">Internet: S/ 209</span>
"brand-logo" and "product-name, done, but I can not read "price" & "best-price "
I tried it this way:
box_3 = soup.find('div','col-xs-12 col-md-8 col-lg-9')
for div in box_3.find_all('div','product-description'):
d={}
d["Marca"] = div.find_all("div",{"class","brand-logo"})[0].getText()
d["Producto"] = div.find_all("span",{"class","product-name"})[0].getText()
d["Precio"] = div.find_all('span',class_='price')
d["Oferta"] = div.find_all('span',class_='best-price ')
l.append(d)
l
out:
{'Marca': 'PACO RABANNE',
'Oferta': [],
'Precio': [<span class="price">Normal: S/ 219</span>],
'Producto': 'PACO RABANNE PERFUME MUJER 30 ML'}
can anyone help me?

You can find the "product-description" div and then iterate over the desired div classes:
from bs4 import BeautifulSoup as soup
import re
_to_find = ['brand-logo', 'product-name', 'price', 'best-price']
s = soup(content, 'html.parser').find('div', {'class':'product-description'})
final_results = [(lambda x: s.find('span', {'class':i}).text if not x else x.text)(s.find('div', {'class':i})) \
for i in _to_find]
filtered = [re.sub('^[\n\s]+|[\n\s]+$', '', i) for i in final_results]
Output:
['PACO RABANNE', 'PACO RABANNE PERFUME MUJER 30 ML', 'Normal: S/ 219', 'Internet: S/ 209']

Unfortunatelly, without the actual website I'm unable to check the solution :(.
Maybe you should extract data from "not working" part the same way as the working one (this is lucky guess - without website or just website that will be parsed by bs4, I'm really unable to test it).
d["Precio"] = div.find_all('span',{"class","price"})[0].getText()
d["Oferta"] = div.find_all('span',{"class","best-price"})[0].getText()
It might be good idea to make new method/function that will get the chosen attribute and handle potential errors.

Python BeautifulSoup get data from span tag

Please have a look at following html code:
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span> </span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span> </span>
</span>
</section>
In the products section, there are 40 such code blocks which contain prices for items. Not all products have old prices but all products have current price. But when I try to access item prices it also gives me old prices, so I get total 69 item prices which should be 40. I am missing something but since I am new to this field I couldn't figure it out. Please someone could help. Thanks.

You can use a CSS selector to match the exact class name. For example, here, you can use span[class="price "] as the selector, and it won't match the old prices.
html = '''
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span>
</span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span>
</span>
</span>
</section>'''
soup = BeautifulSoup(html, 'lxml')
for price in soup.select('span[class="price "]'):
print(price.get_text(' ', strip=True))
Output:
Rs. 5,999
Or, you could also use a custom function to match the class name.
for price in soup.find_all('span', class_=lambda c: c == 'price '):
print(price.get_text(' ', strip=True))

Accessing untagged text using beautifulsoup

I am using python and beautifulsoup4 to extract some address information.
More specifically, I require assistance when retrieving non-US based zip codes.
Consider the following html data of a US based company: (already a soup object)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
I can extract the zipcode (84114-0002) by using the following piece of python code:
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
You can see that I need some counsel with line if address['zip'] == '':
These two soup object examples are giving me trouble. In the below I would like to retrieve EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
as well as below, where I am interested in getting 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
Now, I am running this program over approximately 68,000 accounts of which 28,000 are non-US based. I have only pulled out two examples of which I know the current method is not bullet proof. There may be other address formats where this script is not working as expected but I believe figuring out UK and German based accounts will help tremendously.
Thanks in advance

Because it is only text without tag inside <p> so you can use
find_all(text=True, recursive=False)
to get only text (without tags) but not from nested tags (<span>). This gives list with your text and some \n and spaces so you can use join() to create one string, and strip() to remove all \n and spaces.
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: EC4N 4SA
The same with second HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: 71364

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

unknown error in beautifulsoup web scraping using python - python

Related

Beautiful Soup - finding all classes which contain a know strin

Can't get the xml element value using lxml xpath

Python Beautifulsoup - problem read <span>

Python BeautifulSoup get data from span tag

Accessing untagged text using beautifulsoup

Categories

Resources