Python BeautifulSoup and HTML with unusual spaces

Python BeautifulSoup and HTML with unusual spaces - python

I am trying to update product prices by scraping their prices from a website. However I have reached an unusual html formatting which is giving me some trouble. I am trying to return the price without the spaces. Currently my code brings in all the spaces.
<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>
I am trying the following:
soup = BeautifulSoup(web_page, "html.parser")
for product in soup.find_all('div', class_="product-wrapper"):
# Get product name
product_title = product.find('p', class_='h4 product__title').text
# Get product price
product_price = product.find('p', class_='product__price').text
product_price.strip()
But unfortunately using the .strip() method does not work and the script returns the prices with a bunch of space and "Regular price".
Any ideas on how I can get exactly "£9.99" ?

The reason this does not work is because the p element contains two children:
A span element
A text node
When you cann .text on the parent p element you will drop the "span" tag. In addition to this, the content contains quotes which will make strip() ignore the spaces inside those quotes.
To solve the problem you must first isolate the text content from the span node, which you can do by diving into the span node using .children.
Finally, you can tell .strip() which characters to remove.
So, assumning the structure inside the p element is always like this we can do the following:
from bs4 import BeautifulSoup
data = """
<div>
<p class='product__price'>
<span class='visuallyhidden'>Regular price</span>
"
£9.99
"
</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price')
raw_data = list(product_price.children)[-1]
# Remove spaces, newlines and quotes
cleaned = raw_data.strip(' \n"')
print(repr(cleaned))

You can use contents and get the last element and then split string with "
from bs4 import BeautifulSoup
data='''<p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p>'''
soup=BeautifulSoup(data,'html.parser')
items=soup.select_one('.product__price').contents
print(items[-1].split('"')[1].strip())

you should try this
product_price = product_price.strip().replace(" ","")

An alternative approach: how about regex?
from bs4 import BeautifulSoup
import re
html = """<div><p class='product__price'> == $0
<span class='visuallyhidden'>Regular price</span>
"
£9.99
" == $0
</p></div>"""
soup = BeautifulSoup(html, "html.parser")
for product in soup.find_all('div'):
# Get product price
product_price = product.find('p', class_='product__price').text
# Regex
price = re.search("(£\d*\.?\d*)", product_price)
# Print only when there is a match
if price: print(price[0])

Related

BeautifulSoup: Extracting text from nested tags

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.
I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column. In this case, it's albums that I'd like to create a table of by who mastered the album, produced the album, etc.
Given a snippet of html:
from https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
...
<li>
<span class="entity_1XpR8">Recorded By</span>
" – "
EMI Studios, Paris
</li>
<li>
<span class="entity_1XpR8">Mastered At</span>
" – "
Sterling Sound
</li>
etc etc etc
...
My code so far is something like:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
kw = "Mastered At"
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
x = soup.find_all('span', string='Mastered At')
results.append((url, x))
print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')
With some modifications based on comments below, still having issues:
As you can see I'm trying to do this within a for loop for each link in a list.
The URL list is a simple text file with separate lines for each. Since I'm scraping only one website the sources, class names, and etc should be the same, but the dish will change from page to page.
ex URL list:
https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
https://www.discogs.com/release/3872976-Pink-Floyd-The-Wall
... etc etc etc
updated code snippet:
import requests
import pandas as pd
from bs4 import BeautifulSoup
results = []
with open("urls.txt") as file:
for line in file:
url = line.rstrip()
print(url)
source = requests.get(url).text
soup = BeautifulSoup(source, "html.parser")
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Mastered At']:
results.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(),
x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(results, columns=['Url', 'Mastered At', 'Studio'])
print(df)
df.to_csv('studios.csv')
I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Mastered At — Sterling Sound" (or just "Sterling Sound"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:
example:
album url | Sterling Sound
album url | Abbey Road
album url | Abbey Road
album url | Sterling Sound
album url | Real World Studios
album url | EMI Studios, Paris
album url | Sterling Sound
etc etc etc
Thanks for your help!
Cheers.

The Beautiful Soup library is best suited for this task.
You can use the following code to extract data:
import requests, lxml
from bs4 import BeautifulSoup
# urls.html would be better
with open("urls.txt") as file:
src = file.read()
soup = BeautifulSoup(src, 'lxml')
for first, second in zip(soup.select("li span"), soup.select("li a")):
print(first)
print(second)
To find the desired selector, you can use the select() bs4 method. This method accepts a selector to search for and returns a list of all matched HTML elements.
In this case, I use the zip() built-in function, which allows you to go through two structures at once in one cycle.
Then you can use the data for your tasks.

BeautifulSoup can use different parsers for html. If you have issues with lxml you can try others, like html.parser. You can try the following code, it will create a dataframe from your data, which can then be further saved to csv or other formats:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<li>
<span class = "spClass">Breakfast</span> " — "
Pancakes
</li>
<li>
<span class = "spClass">Lunch</span> " — "
Sandwiches
</li>
<li>
<span class = "spClass">Dinner</span> " — "
Stew
</li>
'''
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in soup.select('li'):
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
print(df) ## you can save the dataframe as csv like so: df.to_csv('foods.csv')
Result:
Url Food Type
0 /examplepage/Pancakes Pancakes Breakfast
1 /examplepage/Sandwiches Sandwiches Lunch
2 /examplepage/Stew Stew Dinner
EDIT: If you only want to extract specific li tags, as per your comment, you can do:
soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Dinner']:
df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
And this will return:
Url Food Type
0 /examplepage/Stew Stew Dinner

How to find text of <div><span>text</span></div> in beautifulsoup?

This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text

I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())

BeautifulSoup Scraping Span Class HTML

I am trying to scrape from the <span class= ''>. The code looks like this on the pages I am scraping:
< span class = "catnum"> Disc Number < / span>
"1"
< br >
< span class = "catnum"> Track Number < / span>
"1"
< br>
< span class = "catnum" > Duration < /span>
"5:28"
<br>
What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.
I tried this code as a test on one page:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("Smith.html"), "html.parser")
for tag in soup.findAll('span'):
if tag.has_key('class'):
if tag['class'] == 'catnum':
print tag.string
I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:
/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning:
has_key is deprecated. Use has_attr("class") instead. key))

as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.
Hope it helps.
Simone

You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number'). Then use .next_sibling to find the text:
from bs4 import BeautifulSoup
import re
s = '''
<span class = "catnum"> Disc Number </span>
"1"
<br/>
<span class = "catnum"> Track Number </span>
"1"
<br/>
<span class = "catnum"> Duration </span>
"5:28"
<br/>'''
soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling

BeautifulSoup - get h2 text without class

My code:
<div id="title">
<h2>
My title <span class="subtitle">My Subtitle</span></h2></div>
If I use this code:
title = soup.find('div', id="title").h2.text
print title
>> My title My Subtitle
It matches everything. I want to match My title and My Subtitle as 2 different objects:
print title
>> My title
print subtitle
>> My subtitle
Any help?

You can get the subtitle and it's preceding sibling separately:
title = soup.find('div', id="title").h2
subtitle = title.find(class_="subtitle")
print(subtitle.previous_sibling.strip(), subtitle.get_text())
Or, you can locate the text node in a non-recursive mode:
title = soup.find('div', id="title").h2
print(title.find(text=True, recursive=False).strip(),
title.find(class_="subtitle").get_text(strip=True))
Both print:
(u'My title', u'My Subtitle')

One way to do it without using the class attribute is:
h2 = soup.find('div', id="title").h2
subtitle = h2.span.text
title = str(h2.contents[0])
The h2.contents[0] will return a NavigableString object here. Its behavior for print is same as that as the string version of it. If you're only going to use the print statement with it, then the str() call won't be necessary.

check out this example to understand
from bs4 import BeautifulSoup
#html source
html_source = '''
<div class="test">
<h2>paragraph1</h2>
</div>
'''
soup = BeautifulSoup(html_source, 'html.parser')
#find h2 tag
print(soup.h2.string)
output
paragraph1
source
link

Another solution.
from simplified_scrapy import SimplifiedDoc
html = '''
<div id="title">
<h2>
My title <span class="subtitle">My Subtitle</span></h2></div>
'''
doc = SimplifiedDoc(html)
h2 = doc.select('div#title').h2
print ('title:',h2.firstText())
print ('subtitle:',h2.span.text)
Result:
title: My title
subtitle: My Subtitle

Append markup string to a tag in BeautifulSoup

Is it possible to set markup as tag content (akin to setting innerHtml in JavaScript)?
For the sake of example, let's say I want to add 10 <a> elements to a <div>, but have them separated with a comma:
soup = BeautifulSoup(<<some document here>>)
a_tags = ["<a>1</a>", "<a>2</a>", ...] # list of strings
div = soup.new_tag("div")
a_str = ",".join(a_tags)
Using div.append(a_str) escapes < and > into < and >, so I end up with
<div> <a1> 1 </a> ... </div>
BeautifulSoup(a_str) wraps this in <html>, and I see getting the tree out of it as an inelegant hack.
What to do?

You need to create a BeautifulSoup object out of your HTML string containing links:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
a_tags = ["<a>1</a>", "<a>2</a>", "<a>3</a>", "<a>4</a>", "<a>5</a>"]
a_str = ",".join(a_tags)
div.append(BeautifulSoup(a_str, 'html.parser'))
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>
Alternative solution:
For each link create a Tag and append it to div. Also, append a comma after each link except last:
from bs4 import BeautifulSoup
soup = BeautifulSoup()
div = soup.new_tag('div')
for x in xrange(1, 6):
link = soup.new_tag('a')
link.string = str(x)
div.append(link)
# do not append comma after the last element
if x != 6:
div.append(",")
soup.append(div)
print soup
Prints:
<div><a>1</a>,<a>2</a>,<a>3</a>,<a>4</a>,<a>5</a></div>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup and HTML with unusual spaces - python

you should try this product_price = product_price.strip().replace(" ","")

Related

BeautifulSoup: Extracting text from nested tags

How to find text of <div><span>text</span></div> in beautifulsoup?

BeautifulSoup Scraping Span Class HTML

BeautifulSoup - get h2 text without class

Append markup string to a tag in BeautifulSoup

Categories

Resources