Parsing a wiki-styled web page, XPath error - python

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.
I have a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//text()'))
It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//*[not(#class="reference")].text()'))
What is the right XPath expression? Thanks in advance :)

Probably, the error occured because of .text() instead of /text().
If you want include also text of p elements then you have to use the descendant-or-self XPath axis:
//*[#id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(#class="reference")]/text()

Related

I found a span on a website that is not visible and I can't scrape it! Why?

Currently I'm trying to scrape data from a website. Therefore I'm using Selenium.
Everything is working as it should. Until I realised I have to scrape a tooltiptext.
I found already different threads on stackoverflow that are providing an answer. Anyway I did not manage to solve this issue so far.
After a few hours of frustration I realised the following:
This span has nothing to do with the tooltip I guess. Because the tooltip looks like this:
There is actually a span that I can't read. I try to read it like this:
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.text)
So Selenium finds this element. But unfortunatly '.text' returns nothing. Why is it always empty ?
And what for is the span from the first screenshot ? Btw. it is not displayed at the Website as well.
Since you've mentioned Selenium finds this element, I would assume you must have print the len of bewertung list
something like
print(len(bewertung))
if this list has some element in it, you could probably use innerText
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.get_attribute("innerText"))
Note that, you are using find_elements which won't throw any error instead if it does not find the element it will return an empty list.
so if you use find_element instead, it would throw the exact error.
Also, I think you've xpath for the span (Which does not appear in UI, sometime they don't appear until some actions are triggered.)
You can try to use this xpath instead:
//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']
Something like this in code:
bewertung = driver.find_elements_by_xpath("//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']")
for item in bewertung:
print(item.text)

I am confused why this XPath selector does not work

I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof
Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'
Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.

How to get non element text adjacent to a tag using Scrapy?

I am trying to scrape a page using Scrapy Framework.
<div class="info"><span class="label">Establishment year</span> 2014</div>
The tag I want to deal with looks like above. I want to get the value 2014. I can't use info or label class as they are common through the page.
So, I tried below xpath but I am getting null:
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling").get()
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling::text()").get()
Any clue what can be the issue?
Since you are trying to extract it in between the tag you should use the tag at the end. I don't know what website you are trying to scrape but here is an example of me scraping in between the 'a' tag on this website http://books.toscrape.com/ Here is the code I used for it
response.xpath("(//h3)[1]/a/text()").extract_first()
In your second line of code you did not use the function for extracting text right. The one you are using is for CSS selector. For Xpath if would be /text(), not ::text(). For you code I think you should try one of these options. Let me know if it helps.
response.xpath("//span[contains(text(),'Establishment year')]/div/text()").get()
or
response.xpath("//span[contains(text(),'Establishment year')]/span/text()").get()
Extract direct text children (/text()) from the parent element:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="info"><span class="label">Establishment year</span> 2014</div>')
>>> selector.xpath('//*[#class="info"]/text()').get()
' 2014'

scrapy get parent element

I'm new to scrapy and need to do one thing. I used to use lxml and did
elements = careers.xpath('//text()[contains(., "engineer")')
After that I was able to do
element = elements[0].getparent()
Unfortunately, I can't do the same with scrapy.
I try doing
response.xpath('//text()[contains(., "engineer")')
as well as .getparent() from any of these elements, but it says that Selectors have no attribute getparent. Is it possible to do the same with scrapy?
To access the parent element, you can use the .. notation at the end of your XPath expression. Consider reading this other StackOverflow answer for more details.
Apart from that, you may want to add a closing ] at the end of your XPath to close the [ before contains.

selenium get element by css selector

I am trying to get user details from each block as given
driver.get("https://www.facebook.com/public/karim-pathan")
wait = WebDriverWait(driver, 10)
li_link = []
for s in driver.find_elements_by_class_name('clearfix'):
print s
print s.find_element_by_css_selector('_8o._8r.lfloat._ohe').get_attribute('href')
print s.find_element_by_tag_name('img').get_attribute('src')
it says:
unable to find element with css selector
any hint appreciable.
Just a mere guess based on assumption that you are not logged in. You are getting exception cause for all class clearfix, element with ._8o._8r.lfloat._ohe does not exists. So your code isn't reaching the required elements. Anyhow, if you are trying to fetch href and img source of results, you need not iterate over all clearfix cause as suggested by #leo.fcx, your css is incorrect, trying the css provided by leo, you can achieve the desired result as:
driver.get("https://www.facebook.com/public/karim-pathan")
for s in driver.find_elements_by_css_selector('._8o._8r.lfloat._ohe'): // there didn't seemed to iterate over each class of clearfix
print s.get_attribute('href')
print s.find_element_by_tag_name('img').get_attribute('src')
P.S. sorry for any syntax, never explored python binding :)
Since you are using all class-names that the element applies, adding a . to the beginning of your CSS selector should fix it.
Try this:
s.find_element_by_css_selector('._8o._8r.lfloat._ohe')
instead of:
s.find_element_by_css_selector('_8o._8r.lfloat._ohe')
Adding to what #leo.fcx pointed about the selector, wait for search results to become visible:
wait.until(EC.visibility_of_element_located((By.ID, "all_search_results")))

Categories

Resources