xpath for a content text from node - python

im doing web scraping for first time using scrapy trying to get some prices from a web site. The thing is that i don't know how to get it because is inside the node content, first time with xpath so i'm little confuse. Let my give the example:
<span class="list d-block">
<span class="value" content="1250">
<span class="sr-only">
Precio reducido de
</span>
<span class="price-original">
<span class="">
$1.250
</span>
(Normal)
</span>
<span class="sr-only">
(Oferta)
</span>
</span>
</span>
I need to get the content, in this case "1250" in this case from #class= "value".
Any help will be great!

As I understand you want to get content attribute value. here is the XPath:
'//span[#class="<value>"]/#content'

On the xml that you posted this xpath should work:
string(//span[#class='value']/#content)
Please find this tutorial for details on xpath.

Related

XPath local-name() SyntaxError: The expression is not a legal expression

I'm trying to web scrape a table from an iframe. In order to switch the driver to that frame I'm using driver.find_element_by_xpath, but the problem is that the path in the html code includes some namespaces that I cannot get Python to figure out using the local-name() function.
Here is the chunk of the HTML I'm using:
<xbrl:campo-captura xbrl:solo-lectura="true" xbrl:id-hecho-plantilla="ar_pros_CorporateStructure_11933a35-3932-44c0-b394-f0ebd4f722d2"
id="8a97271e-df5c-4fbe-bedf-513ea1508bf2"><div>
<div>
<i style="cursor:pointer; float:right;margin-right:-20px;" id="d9fa20ae-c55f-4344-baf5-0112a13827b6" class="i i-arrow-down-2 botonDetalleOperacionXbrl">
</i>
<div id="abrir_nota_F2a26d5a7-2934-4ff0-86df-7a8983c05e47" style="cursor:pointer;float:right;margin-right:-20px;margin-top:20px;" data-toggle="tooltip" data-placement="right" title="Abrir nota">
<i class="fa fa-external-link"></i>
</div>
</div>
<div class="campoTextBlock">
<div id="F2a26d5a7-2934-4ff0-86df-7a8983c05e47">
<div class="celdaAnchoFijo textBlockLimit div-default divTextBlockMaximo" id="divAreaTextod9fa20ae-c55f-4344-baf5-0112a13827b6" style="overflow-y:hidden">
<iframe scrolling="no" id="frame_8a97271e-df5c-4fbe-bedf-513ea1508bf2" style="width:100%;height:100%" frameborder="0"></iframe>
</div>
</div>
</div>
<div>
</div>
</div></xbrl:campo-captura>
I want to get to the "iframe" using something like:
framLogin= driver.find_element_by_xpath('//[local-name()="campo-captura"][#*[local-name()="id-
hecho-plantilla" and .="ar_pros_CorporateStructure_11933a35-3932-44c0-b394-f0ebd4f722d2"]]
/div[2]/div/div/iframe')
The message I get is
Given xpath expression ... is invalid: SyntaxError: Document.evaluate: The expression is not a legal expression.
I've already looked for more information but all I have found is not for Python.
I'm aware I could get to the iframe by using its id, but later on I want to make a loop to scrap the same tables in other URLs with the exact same format, and the iframe's id is not constant.
Your immediate syntax error can be fixed by changing
//[local-name()="campo-captura"]
to
//*[local-name()="campo-captura"]
^

Using beatifulsoup to find text on html

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0
If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

how do I select xpath image without a class name using selenium in python?

How do i select an image xpath without a classname. HTML code is like this
<img alt="" class src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
If I right click and copy xpath it gives me this //*[#id="sortable-results"]/ul/li[1]/a/img but when I use it in my code it has some error
In my code i use like this
src = driver.find_elements_by_xpath('/li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]/#src')
but it returns me an [] when i print(src)
Full div
<li class="result-row" data-pid="7017735595">
<a href="https://vancouver.craigslist.org/van/ele/d/vancouver-sealed-brand-new-in-box/7017735595.html" class="result-image gallery" data-ids="1:00J0J_i9BI6mN6rKP"><img alt="" class="" src="https://images.craigslist.org/00J0J_i9BI6mN6rKP_300x300.jpg">
<span class="result-price">$35</span>
</a>
<p class="result-info">
<span class="icon icon-star" role="button" title="save this post in your favorites list">
<span class="screen-reader-text">favorite this post</span>
</span>
<time class="result-date" datetime="2019-11-11 00:52" title="Mon 11 Nov 12:52:25 AM">Nov 11</time>
Sealed - Brand New in Box - Google Home Mini
<span class="result-meta">
<span class="result-price">$35</span>
<span class="result-hood"> (Vancouver)</span>
<span class="result-tags">
<span class="pictag">pic</span>
</span>
<span class="banish icon icon-trash" role="button">
<span class="screen-reader-text">hide this posting</span>
</span>
<span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span>
<a href="#" class="restore-link">
<span class="restore-narrow-text">restore</span>
<span class="restore-wide-text">restore this posting</span>
</a>
</span>
</p>
</li>
The xpath is close. You need to use // at the beginning of the path and remove the /#src
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]
If you want to make sure the element has src attribute it's like that
//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""][#src]
To get the src attribute use get_attribute('src)
src = driver.find_elements_by_xpath('//li[#class="result-row"]/a[#class="result-image gallery"]/img[#class=""]')[0].get_attribute('src')
Note that find_elements return list, use index to get the first element.
If you want to use class="result-info" to locate the element you can do
elements = driver.find_elements_by_xpath('//p[#class="result-info"]/../a[#class="result-image gallery"]/img[#class=""]')
for element in elements:
src = element.get_attribute('src')
Actually the xpath has been copied correctly,
You have used it in a wrong way in the fetch code.
If you want the specific image, use
image = driver.find_element_by_xpath('//*[#id="sortable-results"]/ul/li[1]/a/img')
Or, if you want a list of all images of same xpath type, use:
images = driver.find_elements_by_xpath('//*[#id="sortable-results"]/ul/li/a/img')
(i.e. remove the specific number of 'li' div or any other div that you want to generalise and use find_elements; you need to use find_element for fetching a specific single element)
To get the attribute 'src', use get_attribute method:
For case 1:
website = image.get_attribute('src')
For case 2:
website = images[0].get_attribute('src')

How to scrape from a span subclass using scrapy

<span class="price-box"> <span class="price"><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="16999"> 16,999</span> </span> <span class="price -old "><span data-currency-iso="PKR">Rs.</span> <span dir="ltr" data-price="50000"> 50,000</span> </span> </span>
Hello. I need some help in extracting the "data-price with "span dir = ""ltr"". I cannot determine how to extract it using scrapy.
It is pretty simple (assuming you get this HTML with a response in spider callback):
>>> response.css('span[dir=ltr]::attr(data-price)').extract()
['16999', '50000']
I would recommend you to read about Scrapy Selectors.
Alternatively to #Stasdeep's answer, you could use xpaths:
response.xpath('//span[#dir="ltr"]/#data-price').extract()
// -> Any sub span, no matter how deep it is
span[#dir="ltr"] -> span with attribute dir equaling "ltr"
#data-price -> same level attribute you want

how to make this regex only find the first match

I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)
Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:
<span class="card-count">1</span>
<span class="card-name">Garruk Relentless</span>
</span>
<span class="row">
<span class="card-count">2</span>
<span class="card-name">Jace, the Mind Sculptor</span>
</span>
</div>
<div class="sorted-by-creature clearfix element">
<h5>Creature (16)</h5>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Deathrite Shaman</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Noble Hierarch</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">Stoneforge Mystic</span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name">True-Name Nemesis</span>
</span>
</div>
<div class="sorted-by-sorcery clearfix element">
<h5>Sorcery (3)</h5>
<span class="row">
<span class="card-count">3</span>
<span class="card-name">Ponder</span>
</span>
And the python code is:
card_number_list=[]
number_of_cards=int(0)
#find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
global number_of_cards
global html
number_in_set= re.search("card-count.*",html)
get_rid= re.search("card-count.*",html).group(0)
html=html.replace(get_rid,"")
number_in_set=number_in_set.group(0)
html=html.replace(number_in_set, "")
number_in_set=number_in_set.replace('card-count">',"")
number_in_set=number_in_set.replace('</span>', "")
card_number_list.append(number_in_set)
number_in_set_int=int(number_in_set)
print(number_in_set_int)
number_of_cards=(number_of_cards+number_in_set_int)
return number_of_cards
while number_of_cards<75:
card_number_regex(card_number_list)
The output I get when I run this is
1
2
4
3
While many seem to rather bash on your choice to use regex for this task, I would argue that it does not seem too difficult for your specific goal and will provide an actual answer for what you asked for.
import re
a = html
b = re.findall('<span class="card-count">(.*?)</span>',a)
print(b[0])
That regex should give the contents of your card-count classes in a list, and using first index you retrieve only the match you want your regex to find.
Obviously this would work less well for other use-cases, but as you seem to know that you only ever want the first occurrence in the html-document it does not matter that list contains all of them, even when they are in another div tag etc.
And as others have said, I don't see why you wouldn't use a regular html parser for this.

Categories

Resources