I am confused why this XPath selector does not work - python

I am learning to use scrapy and playing with XPath selectors, and decided to practice by scraping job titles from craigslist.
Here is the html of a single job link from the craigslist page I am trying to scrape the job titles from:
Full Stack .NET C# Developer (Mid-Level, Senior) ***LOCAL ONLY***
What I wanted to do was retrieve all of the similar a tags with the class result-title, so I used the XPath selector:
titles = response.xpath('//a[#class="result-title"/text()]').getall()
but the output I receive is an empty list: []
I was able to copy the XPath directly from Chrome's inspector, which ended up working perfectly and gave me a full list of job title names. This selector was:
titles = response.xpath('*//div[#id="sortable-results"]/ul/li/p/a/text()').getall()
I can see why this second XPath selector works, but I don't understand why my first attempt did not work. Can someone explain to me why my first XPath selector failed? I have also provided a link to the full html for the craigslist page below if that is helpful/neccessary. I am new to scrapy and want to learn from my mistakes. Thank you!
view-source:https://orangecounty.craigslist.org/search/sof

Like this:
'//a[contains(#class,"result-title ")]/text()'
Or:
'//a[starts-with(#class,"result-title ")]/text()'
I use contains() or starts-with() because the class of the a node is
result-title hdrlnk
not just
result-title
In your XPath:
'//a[#class="result-title"/text()]'
even if the class was result-title, the syntax is wrong, you should use:
'//a[#class="result-title"]/text()'

Simply '//a[#class="result-title hdrlnk"]/text()'
Needed 2 fixes:
/text() outside of []
"result-title hdrlnk" not only "result-title" in attribute selection because XPath is XML parsing not CSS; so exact attribute content is needed to match.

Related

Python Selenium: How do I print the correct tag?

I am trying to print by ID using Selenium. As far as I can tell, "a" is the tag and "title" is the attribute. See HTML below.
When I run the following code:
print(driver.find_elements(By.TAG_NAME, "a")[0].get_attribute('title'))
I get the output:
Zero Tolerance
So I'm getting the first attribute correctly. When I increment the code above:
print(driver.find_elements(By.TAG_NAME, "a")[1].get_attribute('title'))
My expected output is:
Aaliyah Love
However, I'm just getting blank. No errors. What am I doing incorrectly? Pls don't suggest using xpath or css, I'm trying to learn Selenium tags.
HTML:
<a class=" Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4 Link ScenePlayer-ChannelName-Link styles_1lHAYbZZr4" href="/en/channel/ztfilms" title="Zero Tolerance" rel="">Zero Tolerance</a>
...
<a class=" Link ActorThumb-ActorImage-Link styles_3dXcTxVCON Link ActorThumb-ActorImage-Link styles_3dXcTxVCON" href="/[RETRACTED]/Aaliyah-Love/63565" title="Aaliyah Love"
Selenium locators are a toolbox and you're saying you only want to use a screwdriver (By.TAG_NAME) for all jobs. We aren't saying that you shouldn't use By.TAG_NAME, we're saying that you should use the right tool for the right job and sometimes (most times) By.TAG_NAME is not the right tool for the job. CSS selectors are WAY more powerful locators because they can search for not only tags but also classes, properties, etc.
It's hard to say for sure what's going on without access to the site/page. It could be that the entire page isn't loaded and you need to add a wait for the page to finish loading (maybe count links expected on the page?). It could be that your locator isn't specific enough and is catching other A tags that don't have a title attribute.
I would start by doing some debugging.
links = driver.find_elements(By.TAG_NAME, "a")
for link in links:
print(link.get_attribute('title'))
What does this print?
If it prints some blank lines sprinkled throughout the actual titles, your locator is probably not specific enough. Try a CSS selector
links = driver.find_elements(By.CSS_SELECTOR, "a[title]")
for link in links:
print(link.get_attribute('title'))
If instead it returns some titles and then nothing but blank lines, the page is probably not fully loaded. Try something like
count = 20 # the number of expected links on the page
link_locator = (By.TAG_NAME, "a")
WebDriverWait(driver, 10).until(lambda wd: len(wd.find_elements(link_locator)) == count)
links = driver.find_elements(link_locator)
for link in links:
print(link.get_attribute('title'))

How to open a list of links and scrape the text with Selenium

I am new to programming in Python and I want to write a Code to scrape text from articles on reuters using Selenium. I´m trying to open the article links and then get the full text from the article but it doesn´t work. I would be glad if somebody could help me.
article_links1 = []
for link in driver.find_elements_by_xpath("/html/body/div[4]/section[2]/div/div[1]/div[4]/div/div[3]/div[*]/div/h3/a"):
links = link.get_attribute("href")
article_links1.append(links)
article_links = article_links1[:5]
article_links
This is a shortened list of the articles, so it doesn´t take that long to scrape for testing. It contains 5 links, this is the output:
['https://www.reuters.com/article/idUSKCN2DM21B',
'https://www.reuters.com/article/idUSL2N2NS20U',
'https://www.reuters.com/article/idUSKCN2DM20N',
'https://www.reuters.com/article/idUSKCN2DM21W',
'https://www.reuters.com/article/idUSL3N2NS2F7']
Then I tried to iterate over the links and scrape the text out of the paragraphs but it doesn´t work.
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
full_text
The output is only the empty list:
[]
There are a couple issues with your current code. The first one is an easy fix. You need to indent your second for loop, so that it's within the for loop that is iterating through each article. Otherwise, you won't be adding anything to the full_text list until it gets to the last article. It should look like this:
for article in article_links:
driver.switch_to.window(driver.window_handles[-1])
driver.get(article)
time.sleep(5)
for article_text in driver.find_elements_by_xpath("/html/body/div[1]/div/div[4]/div[1]/article/div[1]/p[*]"):
full_text.append(article_text.text)
The second problem lies within your xpath. Xpath can be very long when it's generated automatically by a browser. (I'd suggest learning CSS selectors, which are pretty concise. A good place to learn CSS selectors is called the CSS Diner.)
I've changed your find_elements_by_xpath() function to find_elements_by_css_selector(). You can see the example below.
for article_text in driver.find_elements_by_css_selector("article p"):
full_text.append(article_text.text)

get_attribute('href') returning None

I am learning selenium. I am trying to scrape the amazon website with selenium. Here is the link I am trying to scrape.
In the above url I am trying to extract all the elements with the class a-size-mini and extract the link from these elements.
here is my code
links = driver.find_elements_by_class_name("a-size-mini")
for link in links:
element = WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.LINK_TEXT, link.text)))
print(element.get_attribute('href'))
But this is returning None. I am not sure what I am doing wrong. the length of the links list is showing as 55 and when I try to print the element variable I get the following
<selenium.webdriver.remote.webelement.WebElement (session="121606058bd493d1a70fc957699d7f6d", element="c3dd6f5b-a9bb-409c-8ee2-666cac7e7432")>
So these variables are not empty or None. But when I try to extract the link using get_attribute('href') method it returns None
Please help me out. Thanks in advance
Please use this command.
links = driver.find_elements_by_xpath('//h2[contains(#class, "a-size-mini")]/a')
It's more efficient to parse html by xpath than class name.

How to get non element text adjacent to a tag using Scrapy?

I am trying to scrape a page using Scrapy Framework.
<div class="info"><span class="label">Establishment year</span> 2014</div>
The tag I want to deal with looks like above. I want to get the value 2014. I can't use info or label class as they are common through the page.
So, I tried below xpath but I am getting null:
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling").get()
response.xpath("//span[contains(text(),'Establishment year')]/following-sibling::text()").get()
Any clue what can be the issue?
Since you are trying to extract it in between the tag you should use the tag at the end. I don't know what website you are trying to scrape but here is an example of me scraping in between the 'a' tag on this website http://books.toscrape.com/ Here is the code I used for it
response.xpath("(//h3)[1]/a/text()").extract_first()
In your second line of code you did not use the function for extracting text right. The one you are using is for CSS selector. For Xpath if would be /text(), not ::text(). For you code I think you should try one of these options. Let me know if it helps.
response.xpath("//span[contains(text(),'Establishment year')]/div/text()").get()
or
response.xpath("//span[contains(text(),'Establishment year')]/span/text()").get()
Extract direct text children (/text()) from the parent element:
>>> from parsel import Selector
>>> selector = Selector(text='<div class="info"><span class="label">Establishment year</span> 2014</div>')
>>> selector.xpath('//*[#class="info"]/text()').get()
' 2014'

Parsing a wiki-styled web page, XPath error

I am new to XPath, and I totally fail to parse a simple wiki-styled web page with lxml.
I have a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//text()'))
It works fine, but I need to exclude children whose class is "reference" and get a lxml.etree.XPathEvalError with a following expression:
"".join(tree.xpath('//*[#id="mw-content-text"]/div[1]/p//*[not(#class="reference")].text()'))
What is the right XPath expression? Thanks in advance :)
Probably, the error occured because of .text() instead of /text().
If you want include also text of p elements then you have to use the descendant-or-self XPath axis:
//*[#id="mw-content-text"]/div[1]/p/descendant-or-self::*[not(#class="reference")]/text()

Categories

Resources