How to get non element text adjacent to a tag using Scrapy? - python

I am trying to scrape a page using Scrapy Framework.
<div class="info"><span class="label">Establishment year</span> 2014</div>
The tag I want to deal with looks like above. I want to get the value 2014. I can't use info or label class as they are common through the page.
So, I tried below xpath but I am getting null:
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling").get()
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling::text()").get()
Any clue what can be the issue?

Since you are trying to extract it in between the tag you should use the tag at the end. I don't know what website you are trying to scrape but here is an example of me scraping in between the 'a' tag on this website http://books.toscrape.com/ Here is the code I used for it
response.xpath("(//h3)[1]/a/text()").extract_first()
In your second line of code you did not use the function for extracting text right. The one you are using is for CSS selector. For Xpath if would be /text(), not ::text(). For you code I think you should try one of these options. Let me know if it helps.
response.xpath("//span[contains(text(),'Establishment year')]/div/text()").get()
or
response.xpath("//span[contains(text(),'Establishment year')]/span/text()").get()

Extract direct text children (/text()) from the parent element:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="info"><span class="label">Establishment year</span> 2014</div>')
>>> selector.xpath('//*[#class="info"]/text()').get()
' 2014'

Related

Scrapy css selector get text from occurence of first class

I'm trying to scrape text from class .s-recipe-header__info-item, but as you can see on the picture, there are three classes with the same name and I would like to extract only the first one to get text "Do hodiny" See the image of code here. So far I have tried this code:
recipe_item["preparation_time"] = response.css(".s-recipe-header__info > .s-recipe-header__info-items > .s-recipe-header__info-item::text").extract_first()
I have also tried to use .get() instead of .extract_first(), but both do not seem to work...
I am new to web scraping and I have only elemental HTML and CSS knowledge. Thank you in advance for your help.

I am confused why this XPath selector does not work

I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.

Having trouble finding certain <div> tags using CSS Selectors

I am trying to scrape information from a website using a CSS Selector in order to get a specific text element but have come across a problem. I try to search for my desired portion of the website but my program is telling me that it does not exist. My program returns an empty list.
I am using the requests and lxml libraries and am using CSS Selectors to do my HTML Scraping. I have Python 3.7. I try searching for the part of the website that I need with a selector and it is not appearing. I have also tried using XPath but that has failed as well. I have tried using the following selector:
div#showtimes
When I use this selector, I get the following result:
[<Element div at 0x3bf6f60>]
I get the expected result, which is the desired element. When I try to go one step further and access the element nested inside of the div#showtimes element (see below), I get an empty list.
div#showtimes div
I get the following result:
[]
Through inspection of the website's HTML, I know that there is a nested element within the div#showtimes element. This problem has occurred on other web pages as well. I am using the code below.
import requests
from lxml import html
from lxml.cssselect import CSSSelector
# Set URL
url = "http://www.fridleytheatres.com/location/7425/Paramount-7-Theatres-
Showtimes"
# Get HTML from page
page = requests.get(url)
data = html.fromstring(page.text)
# Set up CSSSelector
sel = CSSSelector('div#showtimes div')
# Apply Selector
results = sel(data)
print(results)
I expect the output to be a list containing a element, but it is returning an empty list [].
If I understand the problem correctly, you're attempting to get a div element which is a child of div#showtimes. Try using div#showtimes > div.

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

Parsing a wiki-styled web page, XPath error

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.
I have a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//text()'))
It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//*[not(#class="reference")].text()'))
What is the right XPath expression? Thanks in advance :)
Probably, the error occured because of .text() instead of /text().
If you want include also text of p elements then you have to use the descendant-or-self XPath axis:
//*[#id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(#class="reference")]/text()

Categories

Resources