Accessing untagged text using beautifulsoup

Accessing untagged text using beautifulsoup - python

I am using python and beautifulsoup4 to extract some address information.
More specifically, I require assistance when retrieving non-US based zip codes.
Consider the following html data of a US based company: (already a soup object)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
I can extract the zipcode (84114-0002) by using the following piece of python code:
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
You can see that I need some counsel with line if address['zip'] == '':
These two soup object examples are giving me trouble. In the below I would like to retrieve EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
as well as below, where I am interested in getting 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
Now, I am running this program over approximately 68,000 accounts of which 28,000 are non-US based. I have only pulled out two examples of which I know the current method is not bullet proof. There may be other address formats where this script is not working as expected but I believe figuring out UK and German based accounts will help tremendously.
Thanks in advance

Because it is only text without tag inside <p> so you can use
find_all(text=True, recursive=False)
to get only text (without tags) but not from nested tags (<span>). This gives list with your text and some \n and spaces so you can use join() to create one string, and strip() to remove all \n and spaces.
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: EC4N 4SA
The same with second HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: 71364

Related

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

I have a tag with the same tag and the same name(property).
Here is my code
first_movie.find('p',{'class' : 'sort-num_votes-visible'})
Here is my output
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
I'm reaching span tag this code;
first_movie.find('span', {'name':'nv',"data-value": True})
978272 --> output
But i want reach the other value with named nv ($858.37M).
My code is only getting this value (978,272) because tags names is equal each other (name = nv)

You're close.
Try using find_all and then grab the last element.
For example:
from bs4 import BeautifulSoup
html_sample = '''
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
'''
soup = (
BeautifulSoup(html_sample, "lxml")
.find_all("span", {'name':'nv',"data-value": True})
)
print(soup[-1].getText())
Output:
$858.37M

If you reach for all spans in p tag, you can work with them like with list and use index to reach for last div.
movies = soup.find('p',{'class' : 'sort-num_votes-visible'})
my_movie = movies.findAll('span')
my_span = my_movie[3].text

Extract html block based on tag, class and string content

I am really new to bf4 and I would like to get specific content from an html page.
When I try the following code, I will get many results having the same tag and class. So I need to filter more. There is a string content into the block I am interested in. Is there a way to additionally scrape also by content? Any contribution is appreciated.
html_doc = requests.get('https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m')
soup = BeautifulSoup(html_doc.content, 'html.parser')
print(soup.find_all('span', class_='sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg'))
Edit:
I should add that the content look like the following. So the there is a string for which I want to extract the value but the value is in the next tag. Here I want to extract 3.79019103 which is under the string 'Final Balance'.
Total Sent
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
13794.11698089 BCH
</span>
</div>
</div>
<div class="sc-1enh6xt-0 jqiNji">
<div class="sc-8sty72-0 kcFwUU">
<div>
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx sc-1n72lkw-0 lhmHll" opacity="1">
Final Balance
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>
</div>
</div>
</div>
</div>
</div>

For finding the Final Balance tag:
final_balance_tag = next(x for x in soup.find_all('span') if 'Final Balance' in x.text)
With this tag you may just jump to the next span tag.
final_balance_tag.findNext('span')
Which gives
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>

Search for the string Final Balance using the text= <your string> parameter.
Get the next tag using find_next(), which returns the first match.
Use a list comprehension to filter the output only if it isdigit().
import requests
from bs4 import BeautifulSoup
URL = 'https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m'
soup = BeautifulSoup(requests.get(URL).content, 'html.parser')
price = soup.find(text=lambda t: "Final" in t).find_next(text=True)
print("".join([t for t in price if t.isdigit()]))
Output (currently):
000000000

Can't get the xml element value using lxml xpath

I am trying to scrape a spotify playlist webpage to pull out artist and song name data. Here is my python code:
#! /usr/bin/python
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
print("\n\nprinting variable playListPage: " + str(playlistPage))
tree = html.fromstring(playlistPage.content)
print("printing variable tree: " + str(tree))
artistList = tree.xpath("//span/a[#class='tracklist-row__artist-name-link']/text()")
print("printing variable artistList: " + str(artistList) + "\n\n")
Right now the final print statement is printing out an empty list.
Here is some example HTML from the page I'm trying to scrape. Ideally my code would pull out the string "M83"...not sure how much html is relevant so pasting what I believe necessary:
<div class="react-contextmenu-wrapper">
<div draggable="true">
<li class="tracklist-row" role="button" tabindex="0" data-testid="tracklist-row">
<div class="tracklist-col position-outer">
<div class="tracklist-play-pause tracklist-top-align">
<svg class="icon-play" viewBox="0 0 85 100">
<path fill="currentColor" d="M81 44.6c5 3 5 7.8 0 10.8L9 98.7c-5 3-9 .7-9-5V6.3c0-5.7 4-8 9-5l72 43.3z">
<title>
PLAY</title>
</path>
</svg>
</div>
<div class="position tracklist-top-align">
<span class="spoticon-track-16">
</span>
</div>
</div>
<div class="tracklist-col name">
<div class="track-name-wrapper tracklist-top-align">
<div class="tracklist-name ellipsis-one-line" dir="auto">
Intro</div>
<div class="second-line">
<span class="TrackListRow__artists ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__artist-name-link" href="/artist/63MQldklfxkjYDoUE4Tppz">
M83</a>
</span>
</span>
</span>
<span class="second-line-separator" aria-label="in album">
•</span>
<span class="TrackListRow__album ellipsis-one-line" dir="auto">
<span class="react-contextmenu-wrapper">
<span draggable="true">
<a tabindex="-1" class="tracklist-row__album-name-link" href="/album/6R0ynY7RF20ofs9GJR5TXR">
Hurry Up, We're Dreaming</a>
</span>
</span>
</span>
</div>
</div>
</div>
<div class="tracklist-col more">
<div class="tracklist-top-align">
<div class="react-contextmenu-wrapper">
<button class="_2221af4e93029bedeab751d04fab4b8b-scss c74a35c3aba27d72ee478f390f5d8c16-scss" type="button">
<div class="spoticon-ellipsis-16">
</div>
</button>
</div>
</div>
</div>
<div class="tracklist-col tracklist-col-duration">
<div class="tracklist-duration tracklist-top-align">
<span>
5:22</span>
</div>
</div>
</li>
</div>
</div>

A solution using Beautiful Soup:
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
soup = bs(page.content, 'lxml')
tracklist_container = soup.find("div", {"class": "tracklist-container"})
track_artists_container = tracklist_container.findAll("span", {"class": "artists-albums"})
artists = []
for ta in track_artists_container:
artists.append(ta.find("span").text)
print(artists[0])
prints
M83
This solution gets all the artists on the page so you could print out the list artists and get:
['M83',
'Charles Bradley',
'Bon Iver',
...
'Death Cab for Cutie',
'Destroyer']
And you can extend this to track names and albums quite easily by changing the classname in the findAll(...) function call.

Nice answer provided by #eNc. lxml solution :
from lxml import html
import requests
playlistPage = requests.get('https://open.spotify.com/playlist/0csaTlUWTfiyXscv4qKDGE')
tree = html.fromstring(playlistPage.content)
artistList = tree.xpath("//span[#class='artists-albums']/a[1]/span/text()")
print(artistList)
Output :
['M83', 'Charles Bradley', 'Bon Iver', 'The Middle East', 'The Antlers', 'Handsome Furs', 'Frank Turner', 'Frank Turner', 'Amy Winehouse', 'Black Lips', 'M83', 'Florence + The Machine', 'Childish Gambino', 'DJ Khaled', 'Kendrick Lamar', 'Future Islands', 'Future Islands', 'JAY-Z', 'Blood Orange', 'Cut Copy', 'Rihanna', 'Tedeschi Trucks Band', 'Bill Callahan', 'St. Vincent', 'Adele', 'Beirut', 'Childish Gambino', 'David Guetta', 'Death Cab for Cutie', 'Destroyer']
Since you can't get all the results in one shot, maybe you should switch for Selenium.

Unable to get the desired portion kicking out the rest

I've written a script in python to grab address from a webpage. When I execute my script I get the address like Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern Germany. They all are within this selector "[itemprop='address']". However, my question is how can i get the address except for the country name which is within "[itemprop='addressCountry']".
The total address are within this block of html:
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
If I try like below I can get the desired portion of address but this is not an ideal way at all:
from bs4 import BeautifulSoup
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
soup = BeautifulSoup(content,"lxml")
[country.extract() for country in soup.select("[itemprop='addressCountry']")]
item = [item.get_text(strip=True) for item in soup.select("[itemprop='address']")]
print(item)
This is the expected output Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern.
To be clearer: I would like to have any oneliner solution without any hardcoded index applied (because the country name may not always appear in the last position).

Solution using lxml.html:
from lxml import html
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
source = html.fromstring(content)
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[#itemprop='streetAddress' or #itemprop='postalCode' or #itemprop='addressLocality']")])
or
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[not(#itemprop='addressCountry')]")])
Output:
'Zimmerbachstr. 51, 74676, Niedernhall-Waldzimmern'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Accessing untagged text using beautifulsoup - python

Related

Scrape values inside span class webpage with beautifulsoup python

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

Extract html block based on tag, class and string content

Can't get the xml element value using lxml xpath

Unable to get the desired portion kicking out the rest

Categories

Resources