Unable to get the desired portion kicking out the rest

Unable to get the desired portion kicking out the rest - python

I've written a script in python to grab address from a webpage. When I execute my script I get the address like Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern Germany. They all are within this selector "[itemprop='address']". However, my question is how can i get the address except for the country name which is within "[itemprop='addressCountry']".
The total address are within this block of html:
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
If I try like below I can get the desired portion of address but this is not an ideal way at all:
from bs4 import BeautifulSoup
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
soup = BeautifulSoup(content,"lxml")
[country.extract() for country in soup.select("[itemprop='addressCountry']")]
item = [item.get_text(strip=True) for item in soup.select("[itemprop='address']")]
print(item)
This is the expected output Zimmerbachstr. 51, 74676 Niedernhall-Waldzimmern.
To be clearer: I would like to have any oneliner solution without any hardcoded index applied (because the country name may not always appear in the last position).

Solution using lxml.html:
from lxml import html
content = """
<div class="push-half--bottom " itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Zimmerbachstr. 51</span>
<span itemprop="postalCode">74676</span>
<span itemprop="addressLocality">Niedernhall-Waldzimmern</span><br>
<span itemprop="addressCountry">Germany</span><br>
</div>
"""
source = html.fromstring(content)
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[#itemprop='streetAddress' or #itemprop='postalCode' or #itemprop='addressLocality']")])
or
address = ", ".join([span.text for span in source.xpath("//div[#itemprop='address']/span[not(#itemprop='addressCountry')]")])
Output:
'Zimmerbachstr. 51, 74676, Niedernhall-Waldzimmern'

Related

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

I have a tag with the same tag and the same name(property).
Here is my code
first_movie.find('p',{'class' : 'sort-num_votes-visible'})
Here is my output
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
I'm reaching span tag this code;
first_movie.find('span', {'name':'nv',"data-value": True})
978272 --> output
But i want reach the other value with named nv ($858.37M).
My code is only getting this value (978,272) because tags names is equal each other (name = nv)

You're close.
Try using find_all and then grab the last element.
For example:
from bs4 import BeautifulSoup
html_sample = '''
<p class="sort-num_votes-visible">
<span class="text-muted">Votes:</span>
<span data-value="978272" name="nv">978,272</span>
<span class="ghost">|</span> <span class="text-muted">Gross:</span>
<span data-value="858,373,000" name="nv">$858.37M</span>
</p>
'''
soup = (
BeautifulSoup(html_sample, "lxml")
.find_all("span", {'name':'nv',"data-value": True})
)
print(soup[-1].getText())
Output:
$858.37M

If you reach for all spans in p tag, you can work with them like with list and use index to reach for last div.
movies = soup.find('p',{'class' : 'sort-num_votes-visible'})
my_movie = movies.findAll('span')
my_span = my_movie[3].text

Extract html block based on tag, class and string content

I am really new to bf4 and I would like to get specific content from an html page.
When I try the following code, I will get many results having the same tag and class. So I need to filter more. There is a string content into the block I am interested in. Is there a way to additionally scrape also by content? Any contribution is appreciated.
html_doc = requests.get('https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m')
soup = BeautifulSoup(html_doc.content, 'html.parser')
print(soup.find_all('span', class_='sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg'))
Edit:
I should add that the content look like the following. So the there is a string for which I want to extract the value but the value is in the next tag. Here I want to extract 3.79019103 which is under the string 'Final Balance'.
Total Sent
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
13794.11698089 BCH
</span>
</div>
</div>
<div class="sc-1enh6xt-0 jqiNji">
<div class="sc-8sty72-0 kcFwUU">
<div>
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx sc-1n72lkw-0 lhmHll" opacity="1">
Final Balance
</span>
</div>
</div>
<div class="sc-8sty72-0 kcFwUU">
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>
</div>
</div>
</div>
</div>
</div>

For finding the Final Balance tag:
final_balance_tag = next(x for x in soup.find_all('span') if 'Final Balance' in x.text)
With this tag you may just jump to the next span tag.
final_balance_tag.findNext('span')
Which gives
<span class="sc-1ryi78w-0 gCzMgE sc-16b9dsl-1 kUAhZx u3ufsr-0 fGQJzg" opacity="1">
3.79019103 BCH
</span>

Search for the string Final Balance using the text= <your string> parameter.
Get the next tag using find_next(), which returns the first match.
Use a list comprehension to filter the output only if it isdigit().
import requests
from bs4 import BeautifulSoup
URL = 'https://www.blockchain.com/bch/address/qqe2tae7hfga2zj5jj8mtjsgznjpy5rvyglew4cy8m'
soup = BeautifulSoup(requests.get(URL).content, 'html.parser')
price = soup.find(text=lambda t: "Final" in t).find_next(text=True)
print("".join([t for t in price if t.isdigit()]))
Output (currently):
000000000

Python BeautifulSoup get data from span tag

Please have a look at following html code:
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span> </span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span> </span>
</span>
</section>
In the products section, there are 40 such code blocks which contain prices for items. Not all products have old prices but all products have current price. But when I try to access item prices it also gives me old prices, so I get total 69 item prices which should be 40. I am missing something but since I am new to this field I couldn't figure it out. Please someone could help. Thanks.

You can use a CSS selector to match the exact class name. For example, here, you can use span[class="price "] as the selector, and it won't match the old prices.
html = '''
<section class = "products">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="5999"> 5,999</span>
</span>
<span class="price -old ">
<span data-currency-iso="PKR">Rs.</span>
<span dir="ltr" data-price="9999"> 9,999</span>
</span>
</span>
</section>'''
soup = BeautifulSoup(html, 'lxml')
for price in soup.select('span[class="price "]'):
print(price.get_text(' ', strip=True))
Output:
Rs. 5,999
Or, you could also use a custom function to match the class name.
for price in soup.find_all('span', class_=lambda c: c == 'price '):
print(price.get_text(' ', strip=True))

Accessing untagged text using beautifulsoup

I am using python and beautifulsoup4 to extract some address information.
More specifically, I require assistance when retrieving non-US based zip codes.
Consider the following html data of a US based company: (already a soup object)
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">999 State St Ste 100</span><br/>
<span class="locality">Salt Lake City,</span>
<span class="region">UT</span>
<span class="zip">84114-0002,</span>
<br/><span class="country-name">United States</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+1-000-000-000
</span><br/>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com" target="_blank">http://www.website.com</a></p>
</div>
</ul>
</div>
I can extract the zipcode (84114-0002) by using the following piece of python code:
class CompanyDescription:
def __init__(self, page):
self.data = page.find('div', attrs={'id': 'companyDescription'})
def address(self):
#TODO: Also retrieve the Zipcode for UK and German based addresses - tricky!
address = {'street-address': '', 'locality': '', 'region': '', 'zip': '', 'country-name': ''}
for key in address:
try:
adr = self.data.find('p', attrs={'id': 'adr'})
if adr.find('span', attrs={'class': key}) is None:
address[key] = ''
else:
address[key] = adr.find('span', attrs={'class': key}).text.split(',')[0]
# Attempting to grab another zip code value
if address['zip'] == '':
pass
except:
# We should return a dictionary with "" as key adr
return address
return address
You can see that I need some counsel with line if address['zip'] == '':
These two soup object examples are giving me trouble. In the below I would like to retrieve EC4N 4SA
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>
<p>
</p>
<p class="companyURL"><a class="url ext" href="http://www.website.com.com" target="_blank">http://www.website.com.com</a></p>
</div>
<p><strong>Line of Business</strong> <br/>Management services, nsk</p>
</div>
as well as below, where I am interested in getting 71364
<div class="compContent curvedBottom" id="companyDescription">
<div class="vcard clearfix">
<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>
<p>
<span class="tel">
<strong class="type">Phone: </strong>+00-1234567
</span><br/>
<span class="tel"><strong class="type">Fax: </strong>+00-1234567</span>
</p>
</div>
</div>
Now, I am running this program over approximately 68,000 accounts of which 28,000 are non-US based. I have only pulled out two examples of which I know the current method is not bullet proof. There may be other address formats where this script is not working as expected but I believe figuring out UK and German based accounts will help tremendously.
Thanks in advance

Because it is only text without tag inside <p> so you can use
find_all(text=True, recursive=False)
to get only text (without tags) but not from nested tags (<span>). This gives list with your text and some \n and spaces so you can use join() to create one string, and strip() to remove all \n and spaces.
data = '''<p id="adr">
<span class="street-address">Albert Buildings</span><br/>
<span class="extended-address">00 Queen Victoria Street</span>
<span class="locality">London</span>
EC4N 4SA
<span class="region">London</span>
<br/><span class="country-name">England</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: EC4N 4SA
The same with second HTML
data = '''<p id="adr">
<span class="street-address">Alfred-Kärcher-Str. 100</span><br/>
71364
<span class="locality">Winnenden</span>
<span class="region">Baden-Württemberg</span>
<br/><span class="country-name">Germany</span>
</p>'''
from bs4 import BeautifulSoup as BS
soup = BS(data, 'html.parser').find('p')
print(''.join(soup.find_all(text=True, recursive=False)).strip())
result: 71364

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to get the desired portion kicking out the rest - python

Related

Scrape values inside span class webpage with beautifulsoup python

How do I scrape data from a tag belonging to the same label and the same class? BeautifulSoup

Extract html block based on tag, class and string content

Python BeautifulSoup get data from span tag

Accessing untagged text using beautifulsoup

Categories

Resources