How to scrape from a span subclass using scrapy

How to scrape from a span subclass using scrapy - python

<span class="price-box"> <span class="price"><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="16999"> 16,999</span> </span> <span class="price -old "><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="50000"> 50,000</span> </span> </span>
Hello. I need some help in extracting the "data-price with "span dir = ""ltr"". I cannot determine how to extract it using scrapy.

It is pretty simple (assuming you get this HTML with a response in spider callback):
>>> response.css('span[dir=ltr]::attr(data-price)').extract()
['16999', '50000']
I would recommend you to read about Scrapy Selectors.

Alternatively to #Stasdeep's answer, you could use xpaths:
response.xpath('//span[#dir="ltr"]/#data-price').extract()
// -> Any sub span, no matter how deep it is
span[#dir="ltr"] -> span with attribute dir equaling "ltr"
#data-price -> same level attribute you want

Related

Using beatifulsoup to find text on html

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>

You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0

If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

xpath for a content text from node

im doing web scraping for first time using scrapy trying to get some prices from a web site. The thing is that i don't know how to get it because is inside the node content, first time with xpath so i'm little confuse. Let my give the example:
<span class="list d-block">
<span class="value" content="1250">
<span class="sr-only">
Precio reducido de
</span>
<span class="price-original">
<span class="">
$1.250
</span>
(Normal)
</span>
<span class="sr-only">
(Oferta)
</span>
</span>
</span>
I need to get the content, in this case "1250" in this case from #class= "value".
Any help will be great!

As I understand you want to get content attribute value. here is the XPath:
'//span[#class="<value>"]/#content'

On the xml that you posted this xpath should work:
string(//span[#class='value']/#content)
Please find this tutorial for details on xpath.

how do I select xpath image without a class name using selenium in python?

How do i select an image xpath without a classname. HTML code is like this
<img alt="" class src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
If I right click and copy xpath it gives me this //*[#id="sortable-results"]/ul/li[1]/a/img but when I use it in my code it has some error
In my code i use like this
src = driver.find_elements_by_xpath('/li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]/#src')
but it returns me an [] when i print(src)
Full div
<li class="result-row" data-pid="7017735595">
<a href="https://vancouver.craigslist.org/van/ele/d/vancouver-sealed-brand-new-in-box/7017735595.html" class="result-image gallery" data-ids="1:00J0J_i9BI6mN6rKP"><img alt="" class="" src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
<span class="result-price">$35</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button" title="save this post in your favorites list">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2019-11-11 00:52" title="Mon 11 Nov 12:52:25 AM">Nov 11</time>
Sealed - Brand New in Box - Google Home Mini
<span class="result-meta">
<span class="result-price">$35</span>
<span class="result-hood"> (Vancouver)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span>
<a href="#" class="restore-link">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>
</li>

The xpath is close. You need to use // at the beginning of the path and remove the /#src
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]
If you want to make sure the element has src attribute it's like that
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""][#src]
To get the src attribute use get_attribute('src)
src = driver.find_elements_by_xpath('//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]')[0].get_attribute('src')
Note that find_elements return list, use index to get the first element.
If you want to use class="result-info" to locate the element you can do
elements = driver.find_elements_by_xpath('//p[#class="result-info"]/../a[#class="result-image gallery"]/img[#class=""]')
for element in elements:
src = element.get_attribute('src')

Actually the xpath has been copied correctly,
You have used it in a wrong way in the fetch code.
If you want the specific image, use
image = driver.find_element_by_xpath('//*[#id="sortable-results"]/ul/li[1]/a/img')
Or, if you want a list of all images of same xpath type, use:
images = driver.find_elements_by_xpath('//*[#id="sortable-results"]/ul/li/a/img')
(i.e. remove the specific number of 'li' div or any other div that you want to generalise and use find_elements; you need to use find_element for fetching a specific single element)
To get the attribute 'src', use get_attribute method:
For case 1:
website = image.get_attribute('src')
For case 2:
website = images[0].get_attribute('src')

How can I parse html file using python and beautiful soup from html tag under html tag value?

My html file contains same tag(<span class="fna">) multiple times. If I want to differentiate this tag then i need to look previous tag. Tag() under tag(<span id="field-value-reporter">).
In beautiful soup, I can apply only on tag condition like, soup.find_all("span", {"id": "fna"}). This function extract all data for tag(<span class="fna">) but I need only which contain under tag(<span id="field-value-reporter")
Example html tags:
<div class="value">
<span id="field-value-reporter">
<div class="vcard vcard_287422" >
<a class="email " href="/user_profile?user_id=287422" >
<span class="fna">Chris Pearce (:cpearce)
</span>
</a>
</div>
</span>
</div>
<div class="value">
<span id="field-value-triage_owner">
<div class="vcard vcard_27780" >
<a class="email " href="/user_profile?user_id=27780">
<span class="fna">Justin Dolske [:Dolske]
</span>
</a>
</div>
</span>
</div>

Use soup.select:
soup.select('#field-value-reporter a > span') # select for all tags that are children of a tag whose id is field-value-reporter
>>> [<span class="fna">Chris Pearce (:cpearce)</span>]
soup.select uses css selector and are, in my opinion, much more capable than the default element search that comes with BeautifulSoup. Note that all results are returned as list and contains everything that match.

How can I get only visible text from some HTML node in Python

How can I get only visible text from some HTML node in Python?
Suppose that I have a node like this:
<span>
<style>.vAnH{display:none}.vsP6{display:inline}</style>
<span class="vAnH">34</span>
<span />
<span style="display: inline">111</span>
<span style="display:none">120</span>
<span class="vAnH">120</span>
<div style="display:none">120</div>
<span class="78">.</span>
<span class="vAnH">100</span>
<div style="display:none">100</div>
161
<span style="display: inline">.</span>
<span class="174">126</span>
<span class="vAnH">159</span>
<div style="display:none">159</div>
<span />
<span class="vsP6">.</span>
<span style="display:none">5</span>
<span class="vAnH">5</span>
<div style="display:none">5</div>
<span style="display:none">73</span>
<span class="vAnH">73</span>
<div style="display:none">73</div>
<span class="221">98</span>
<span style="display:none">194</span>
<div style="display:none">194</div>
</span>
Is there any third-party libraries to do it or should I parse it manually?

There are multiple ways to make a node visible/hidden for the end user in the browser. BeautifulSoup is an HTML Parser, it doesn't know if an element would be shown or not. Though, there was an attempt here:
BeautifulSoup Grab Visible Webpage Text
It would not work if, for example, an element is hidden by a CSS rule, but might work for your use case.
The easiest option would be to switch to selenium. .text here returns only visible text of an element:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://domain.com')
element = driver.find_element_by_id('id_of_an_element')
print(element.text)

If you don't want to go the Selenium way, you can get something with BeautifulSoup:
from bs4 import BeautifulSoup
def is_visible_span_or_div(tag, is_parent=False):
""" This function checks if the element is a span or a div,
and if it is visible. If so, it recursively checks all the parents
and returns False is one of them is hidden """
# loads the style attribute of the element
style = tag.attrs.get('style', False)
# checks if element is div or span, if it's not a parent
if not is_parent and tag.name not in ('div', 'span'):
return False
# checks if the element is hidden
if style and ('hidden' in style or 'display: none' in style):
return False
# makes a recursive call to check the parent as well
parent = tag.parent
if parent and not is_visible_span_or_div(parent, is_parent=True):
return False
# neither the element nor its parent(s) are hidden, so return True
return True
html = """
<span style="display: none;">I am not visible</span>
<span style="display: inline">I am visible</span>
<div style="display: none;">
<span>I am a visible span inside a hidden div</span>
</div>
"""
soup = BeautifulSoup(html)
visible_elements = soup.find_all(is_visible_span_or_div)
print(visible_elements)
Keep in mind that it's not going to exactly reflect the way a browser would display or hide the elements, though, because other factors could decide for the visibility of an element (such as width, height, opacity, absolute positioning outside the window...).
Despite of that, this script is quite reliable because it recursively checks for all the element's parents as well and returns False as soon as it finds a hidden parent.
The only problem I see with this function is that it has quite an overhead, because it has to check for all the parents for every element, even if those elements happen to be just aside in the DOM tree. It could be easily optimised for that, but perhaps at the cost of readability.

You'll need to write a custom filter function. A working example:
from bs4 import BeautifulSoup
import re
data = '''<span>
<style>.vAnH{display:none}.vsP6{display:inline}</style>
<span class="vAnH">34</span>
<span />
<span style="display: inline">111</span>
<span style="display:none">120</span>
<span class="vAnH">120</span>
<div style="display:none">120</div>
<span class="78">.</span>
<span class="vAnH">100</span>
<div style="display:none">100</div>
161
<span style="display: inline">.</span>
<span class="174">126</span>
<span class="vAnH">159</span>
<div style="display:none">159</div>
<span />
<span class="vsP6">.</span>
<span style="display:none">5</span>
<span class="vAnH">5</span>
<div style="display:none">5</div>
<span style="display:none">73</span>
<span class="vAnH">73</span>
<div style="display:none">73</div>
<span class="221">98</span>
<span style="display:none">194</span>
<div style="display:none">194</div>
</span>'''
soup = BeautifulSoup(data)
no_disp = re.search(r'\.(.+?){display:none}', soup.style.string).group(1)
def find_visible(tag):
return (not tag.name == 'style') and (not no_disp in tag.get('class', '')) and (not 'display:none' in tag.get('style', ''))
for tag in soup.find_all(find_visible, text=True):
print tag.string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape from a span subclass using scrapy - python

It is pretty simple (assuming you get this HTML with a response in spider callback): >>> response.css('span[dir=ltr]::attr(data-price)').extract() ['16999', '50000'] I would recommend you to read about Scrapy Selectors.

Alternatively to #Stasdeep's answer, you could use xpaths: response.xpath('//span[#dir="ltr"]/#data-price').extract() // -> Any sub span, no matter how deep it is span[#dir="ltr"] -> span with attribute dir equaling "ltr" #data-price -> same level attribute you want

Related

Using beatifulsoup to find text on html

xpath for a content text from node

how do I select xpath image without a class name using selenium in python?

How can I parse html file using python and beautiful soup from html tag under html tag value?

How can I get only visible text from some HTML node in Python

Categories

Resources