Unable to get the desired portion kicking out the rest - python

I've written a script in python to grab address from a webpage. When I execute my script I get the address like Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern Germany. They all are within this selector "[itemprop='address']". However, my question is how can i get the address except for the country name which is within "[itemprop='addressCountry']".
The total address are within this block of html:
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
If I try like below I can get the desired portion of address but this is not an ideal way at all:
from bs4 import BeautifulSoup
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
soup = BeautifulSoup(content,"lxml")
[country.extract() for country in soup.select("[itemprop='addressCountry']")]
item = [item.get_text(strip=True) for item in soup.select("[itemprop='address']")]
print(item)
This is the expected output Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern.
To be clearer: I would like to have any oneliner solution without any hardcoded index applied (because the country name may not always appear in the last position).

Solution using lxml.html:
from lxml import html
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
source = html.fromstring(content)
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[#itemprop='streetAddress' or #itemprop='postalCode' or #itemprop='addressLocality']")])
or
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[not(#itemprop='addressCountry')]")])
Output:
'Zimmerbachstr. 51, 74676, Niedernhall-Waldzimmern'

Related

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.
You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

I have a tag with the same tag and the same name(property).
Here is my code
first_movie.find('p',{'class' : 'sort-num_votes-visible'})
Here is my output
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
I'm reaching span tag this code;
first_movie.find('span', {'name':'nv',"data-value": True})
978272 --> output
But i want reach the other value with named nv ($858.37M).
My code is only getting this value (978,272) because tags names is equal each other (name = nv)
You're close.
Try using find_all and then grab the last element.
For example:
from bs4 import BeautifulSoup
html_sample = '''
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
'''
soup = (
BeautifulSoup(html_sample, "lxml")
.find_all("span", {'name':'nv',"data-value": True})
)
print(soup[-1].getText())
Output:
$858.37M
If you reach for all spans in p tag, you can work with them like with list and use index to reach for last div.
movies = soup.find('p',{'class' : 'sort-num_votes-visible'})
my_movie = movies.findAll('span')
my_span = my_movie[3].text

Extract html block based on tag, class and string content

I am really new to bf4 and I would like to get specific content from an html page.
When I try the following code, I will get many results having the same tag and class. So I need to filter more. There is a string content into the block I am interested in. Is there a way to additionally scrape also by content? Any contribution is appreciated.
html_doc = requests.get('https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m')
soup = BeautifulSoup(html_doc.content, 'html.parser')
print(soup.find_all('span', class_='sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg'))
Edit:
I should add that the content look like the following. So the there is a string for which I want to extract the value but the value is in the next tag. Here I want to extract 3.79019103 which is under the string 'Final Balance'.
Total Sent
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
13794.11698089 BCH
</span>
</div>
</div>
<div class="sc-1enh6xt-0 jqiNji">
<div class="sc-8sty72-0 kcFwUU">
<div>
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx sc-1n72lkw-0 lhmHll" opacity="1">
Final Balance
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>
</div>
</div>
</div>
</div>
</div>
For finding the Final Balance tag:
final_balance_tag = next(x for x in soup.find_all('span') if 'Final Balance' in x.text)
With this tag you may just jump to the next span tag.
final_balance_tag.findNext('span')
Which gives
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>
Search for the string Final Balance using the text= <your string> parameter.
Get the next tag using find_next(), which returns the first match.
Use a list comprehension to filter the output only if it isdigit().
import requests
from bs4 import BeautifulSoup
URL = 'https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m'
soup = BeautifulSoup(requests.get(URL).content, 'html.parser')
price = soup.find(text=lambda t: "Final" in t).find_next(text=True)
print("".join([t for t in price if t.isdigit()]))
Output (currently):
000000000

Python BeautifulSoup get data from span tag

Please have a look at following html code:
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span> </span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span> </span>
</span>
</section>
In the products section, there are 40 such code blocks which contain prices for items. Not all products have old prices but all products have current price. But when I try to access item prices it also gives me old prices, so I get total 69 item prices which should be 40. I am missing something but since I am new to this field I couldn't figure it out. Please someone could help. Thanks.
You can use a CSS selector to match the exact class name. For example, here, you can use span[class="price "] as the selector, and it won't match the old prices.
html = '''
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span>
</span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span>
</span>
</span>
</section>'''
soup = BeautifulSoup(html, 'lxml')
for price in soup.select('span[class="price "]'):
print(price.get_text(' ', strip=True))
Output:
Rs. 5,999
Or, you could also use a custom function to match the class name.
for price in soup.find_all('span', class_=lambda c: c == 'price '):
print(price.get_text(' ', strip=True))

Accessing untagged text using beautifulsoup

I am using python and beautifulsoup4 to extract some address information.
More specifically, I require assistance when retrieving non-US based zip codes.
Consider the following html data of a US based company: (already a soup object)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
I can extract the zipcode (84114-0002) by using the following piece of python code:
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
You can see that I need some counsel with line if address['zip'] == '':
These two soup object examples are giving me trouble. In the below I would like to retrieve EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
as well as below, where I am interested in getting 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
Now, I am running this program over approximately 68,000 accounts of which 28,000 are non-US based. I have only pulled out two examples of which I know the current method is not bullet proof. There may be other address formats where this script is not working as expected but I believe figuring out UK and German based accounts will help tremendously.
Thanks in advance
Because it is only text without tag inside <p> so you can use
find_all(text=True, recursive=False)
to get only text (without tags) but not from nested tags (<span>). This gives list with your text and some \n and spaces so you can use join() to create one string, and strip() to remove all \n and spaces.
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: EC4N 4SA
The same with second HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: 71364

Categories

Resources