Using beatifulsoup to find text on html - python

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>

You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0

If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

Related

Extract Parent Text without Children Text; Parsing HTML

I have this small bit of a soup tag element that I pulled using Selenium & BeautifulSoup.
<footer>
<p class="tags environment-tags">Environment:
<span class="tag environment-tag">Desert</span>
</p>
<p class="source monster-source">Basic Rules
<span class="page-number">, pg. 334</span>
</p>
</footer>
I am trying to grab the Text from just the p elements, but every time I try it grabs the span as well. So far this is what I tried:
for p in Environment.findAll('p'):
print(p.text)
I have also tried to extract the information using .extract() but that doesn't seem to work for me.
You can use .contents and access the 0th element:
for tag in soup.find_all("p"):
print(tag.contents[0].strip())
Output:
Environment:
Basic Rules
Or with your attempt, you can remove the <span>'s using .extract() by:
for tag in soup.select("p span"):
tag.extract()
print(soup.prettify())
Output:
<footer>
<p class="tags environment-tags">
Environment:
</p>
<p class="source monster-source">
Basic Rules
</p>
</footer>

xpath for a content text from node

im doing web scraping for first time using scrapy trying to get some prices from a web site. The thing is that i don't know how to get it because is inside the node content, first time with xpath so i'm little confuse. Let my give the example:
<span class="list d-block">
<span class="value" content="1250">
<span class="sr-only">
Precio reducido de
</span>
<span class="price-original">
<span class="">
$1.250
</span>
(Normal)
</span>
<span class="sr-only">
(Oferta)
</span>
</span>
</span>
I need to get the content, in this case "1250" in this case from #class= "value".
Any help will be great!
As I understand you want to get content attribute value. here is the XPath:
'//span[#class="<value>"]/#content'
On the xml that you posted this xpath should work:
string(//span[#class='value']/#content)
Please find this tutorial for details on xpath.

how do I select xpath image without a class name using selenium in python?

How do i select an image xpath without a classname. HTML code is like this
<img alt="" class src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
If I right click and copy xpath it gives me this //*[#id="sortable-results"]/ul/li[1]/a/img but when I use it in my code it has some error
In my code i use like this
src = driver.find_elements_by_xpath('/li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]/#src')
but it returns me an [] when i print(src)
Full div
<li class="result-row" data-pid="7017735595">
<a href="https://vancouver.craigslist.org/van/ele/d/vancouver-sealed-brand-new-in-box/7017735595.html" class="result-image gallery" data-ids="1:00J0J_i9BI6mN6rKP"><img alt="" class="" src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
<span class="result-price">$35</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button" title="save this post in your favorites list">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2019-11-11 00:52" title="Mon 11 Nov 12:52:25 AM">Nov 11</time>
Sealed - Brand New in Box - Google Home Mini
<span class="result-meta">
<span class="result-price">$35</span>
<span class="result-hood"> (Vancouver)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span>
<a href="#" class="restore-link">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>
</li>
The xpath is close. You need to use // at the beginning of the path and remove the /#src
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]
If you want to make sure the element has src attribute it's like that
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""][#src]
To get the src attribute use get_attribute('src)
src = driver.find_elements_by_xpath('//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]')[0].get_attribute('src')
Note that find_elements return list, use index to get the first element.
If you want to use class="result-info" to locate the element you can do
elements = driver.find_elements_by_xpath('//p[#class="result-info"]/../a[#class="result-image gallery"]/img[#class=""]')
for element in elements:
src = element.get_attribute('src')
Actually the xpath has been copied correctly,
You have used it in a wrong way in the fetch code.
If you want the specific image, use
image = driver.find_element_by_xpath('//*[#id="sortable-results"]/ul/li[1]/a/img')
Or, if you want a list of all images of same xpath type, use:
images = driver.find_elements_by_xpath('//*[#id="sortable-results"]/ul/li/a/img')
(i.e. remove the specific number of 'li' div or any other div that you want to generalise and use find_elements; you need to use find_element for fetching a specific single element)
To get the attribute 'src', use get_attribute method:
For case 1:
website = image.get_attribute('src')
For case 2:
website = images[0].get_attribute('src')

Exclude hidden tags while scraping using b4

I have a website that has plenty of hidden tags in the html.
I have pasted the source code below.
The challenge is that there are 2 types on hidden tags,
1. Ones with style="display:none"
2. They have list of styles mentioned under every td tag.
And it changes with every td tag.
for the example below it has the following styles,
hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
So the elements with class=hLcj, kUC, mXJU, rr9s,etc are hidden elements
I want to extract the text of entire tr but exclude these hidden tags.
I have been scratching my head for hours and still no success.
Any help would be much appreciated. Thanks
I am using bs4 and python 2.7
<td class="leftborder timestamp" rel="1416853322">
<td>
<span>
<style>
.hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
</style>
<span class="rr9s">35</span>
<span></span>
<div style="display:none">121</div>
<span class="226">199</span>
.
<span class="rr9s">116</span>
<div style="display:none">116</div>
<span></span>
<span class="Dzkb">200</span>
<span style="display: inline">.</span>
<span style="display:none">86</span>
<span class="kUC-">86</span>
<span></span>
120
<span class="kUC-">134</span>
<div style="display:none">134</div>
<span class="mXJU">151</span>
<div style="display:none">151</div>
<span class="rr9s">154</span>
<span class="Dzkb">.</span>
<span class="119">36</span>
<span class="kUC-">157</span>
<div style="display:none">157</div>
<span class="rr9s">249</span>
<div style="display:none">249</div>
</span>
</td>
<td> 7808</td>
Using selenium would make the task much easier since it knows what elements are hidden and which aren't.
But, anyway, here's a basic code that you would probably need to improve more. The idea here is to parse the style tag and get the list of classes to exclude, have a list of tags to exclude and check the style attribute of each child element in tr:
import re
from bs4 import BeautifulSoup
data = """ your html here """
soup = BeautifulSoup(data)
tr = soup.tr
# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
match = re.match(r'^\.(.*?)\{display:none\}', line)
if match:
classes_to_exclude.append(match.group(1))
tags_to_exclude = ['style', 'script']
texts = []
for item in tr.find_all(text=True):
if item.parent.name in tags_to_exclude:
continue
class_ = item.parent.get('class')
if class_ and class_[0] in classes_to_exclude:
continue
if item.parent.get('style') == 'display:none':
continue
texts.append(item)
print ''.join(texts.strip())
Prints:
199.200.120.36
Also see:
BeautifulSoup Grab Visible Webpage Text

Extracting HTML data fields with Python

Please forgive me for my lack of knowledge, but given HTML in the following format, what is the best way to extract the individual data fields? Please keep in mind that more often than not some, or all, of them will be NULL in which case we'll keep them at NULL.
<div class="profile-section" id="a-bit-more-about">
<dl>
<dt>Name:</dt>
<dd><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></dd>
</dl>
<!-- <span class="RealName">/ <span class="fn n"><span class="given-name">Clem</span> <span class="family-name">Kadiddlehopper</span></span></span> -->
<dl>
<dt>Joined:</dt>
<dd>September 1910</dd>
</dl>
<div class="sep"></div>
<dl>
<dt>Hometown:</dt>
<dd>Quiet Rest Maximum Security Twilight Home</dd>
</dl>
<dl>
<dt>Currently:</dt>
<dd><span class="adr"><span class="locality">They won't tell me</span>, <span class="country-name">Zimbobwe</span></span></dd>
</dl>
<div class="sep"></div>
You want an HTML parser. I recommend beautiful soup or lxml.
Use third-party modules beautiful soup, lxml or built-in module html.parser. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<html><body><a>bbb</a></body></html')
soup.find('a')
Or if like, you can use regex for small target.

Categories

Resources