xpath to href that contains certain keyword in link itself - python

What i need is to find all the links on the page that have some keyword inside the link itself. So , based on some stack topics i build my xpath as follows:
response.xpath('//a[contains(#href,'/Best-Sellers-Health-Personal-Care')]')
which should return a link like = "https://www.amazon.com/Best-Sellers-Health-Personal-Care-Tongue......"
But i get Invalid syntax error all the time. What am I mistaken?
So what i do now is just add the if contains check when iterating through the list. But hoped there were more elegant and fast solution.

This is because of inconsistent quotes usage.
Just replace
'//a[contains(#href,'/Best-Sellers-Health-Personal-Care')]'
with
'//a[contains(#href,"/Best-Sellers-Health-Personal-Care")]'

Related

I found a span on a website that is not visible and I can't scrape it! Why?

Currently I'm trying to scrape data from a website. Therefore I'm using Selenium.
Everything is working as it should. Until I realised I have to scrape a tooltiptext.
I found already different threads on stackoverflow that are providing an answer. Anyway I did not manage to solve this issue so far.
After a few hours of frustration I realised the following:
This span has nothing to do with the tooltip I guess. Because the tooltip looks like this:
There is actually a span that I can't read. I try to read it like this:
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.text)
So Selenium finds this element. But unfortunatly '.text' returns nothing. Why is it always empty ?
And what for is the span from the first screenshot ? Btw. it is not displayed at the Website as well.
Since you've mentioned Selenium finds this element, I would assume you must have print the len of bewertung list
something like
print(len(bewertung))
if this list has some element in it, you could probably use innerText
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.get_attribute("innerText"))
Note that, you are using find_elements which won't throw any error instead if it does not find the element it will return an empty list.
so if you use find_element instead, it would throw the exact error.
Also, I think you've xpath for the span (Which does not appear in UI, sometime they don't appear until some actions are triggered.)
You can try to use this xpath instead:
//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']
Something like this in code:
bewertung = driver.find_elements_by_xpath("//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']")
for item in bewertung:
print(item.text)

Problems with XPath in Selenium

I want to find element based on it attributes.
I have already tried searching by all divs, and specify by attributes, and even searching by *. None of this was solution.
Whole element looks like this:
<div class="charc" data-lvl="66" data-world="walios" data-nick="mirek">
This is my search expression:
driver.find_element_by_xpath('//div[#data-world="walios"] and [#data-nick="mirek"]')
I would like to find this element using python with selenium, and be able to click on it.
Actually I am getting the error
SyntaxError: Failed to execute 'evaluate' on 'Document': The string '//div[#data-world="walios"] and [#data-nick="mirek"]' is not a valid XPath expression.
What am I doing wrong?
The error message is correct because your predicate(s) is/are not correct.
Try putting the predicate in one [...] expression:
driver.find_element_by_xpath('//div[#data-world="walios" and #data-nick="mirek"]')
driver.find_elements_by_xpath("//div[#data-world="walios" and #data-nick="mirek)]")
or
driver.find_elements_by_xpath("//div[#data-world="walios"][#data-nick="mirek)]")
The multiple conditions for selecting the tag can't be within nested []. Either you have to specify within one [] or within multiple []s.
XPath axes methods:
These XPath axes methods are used to find the complex or dynamic elements. Below we will see some of these methods.
XPath expression select nodes or list of nodes on the basis of attributes like ID , Name, Classname, etc. from the XML document .

How to click one by one to get data from website by using selenium python

I am trying to get data from the website but I want to select first 1000 link open one by one and get data from there.
I have tried:
list_links = driver.find_elements_by_tag_name('a')
for i in list_links:
print (i.get_attribute('href'))
through this getting extra links which are not required.
for example: https://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1,2,3,4,5,%3E5&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa,Residential-Plot&cityName=Mumbai
we will get more than 50k link. How to open only first 1000 link has in below with properties photos.
Edit
I have tried this also:
driver.find_elements_by_xpath("//div[#class='.l-srp__results.flex__item']")
driver.find_element_by_css_selector('a').get_attribute('href')
for matches in driver:
print('Liking')
print (matches)
#matches.click()
time.sleep(5)
But getting error: TypeError: 'WebDriver' object is not iterable
Why I am not getting link by using this line: driver.find_element_by_css_selector('a').get_attribute('href')
Edit 1
I am trying to sort links as per below but getting error
result = re.findall(r'https://www.magicbricks.com/propertyDetails/', my_list)
print (result)
Error: TypeError: expected string or bytes-like object
or Tried
a = ['https://www.magicbricks.com/propertyDetails/']
output_names = [name for name in a if (name[:45] in my_list)]
print (output_names)
Not getting anything.
All links are in list. Please suggest
Thank you in advance. Please suggest
Selenium is not a good idea for web scraping. I would suggest you to use JMeter which is FREE and Open Source.
http://www.testautomationguru.com/jmeter-how-to-do-web-scraping/
If you want to use selenium, the approach you are trying to follow is not a stable approach - clicking and grabbing the data. Instead I would suggest you to follow this - something similar here. The example is in java. But you could get the idea.
driver.get("https://www.yahoo.com");
Map<Integer, List<String>> map = driver.findElements(By.xpath("//*[#href]"))
.stream() // find all elements which has href attribute & process one by one
.map(ele -> ele.getAttribute("href")) // get the value of href
.map(String::trim) // trim the text
.distinct() // there could be duplicate links , so find unique
.collect(Collectors.groupingBy(LinkUtil::getResponseCode)); // group the links based on the response code
More info is here.
http://www.testautomationguru.com/selenium-webdriver-how-to-find-broken-links-on-a-page/
I believe you should collect all the elements in list which having tag name "a" with "href" properties which is not null.
Then traverse through the list and click on element one by one.
Create a list of type WebElement and store all the valid links.
Here you can apply more filters or conditions i.e. link contains some characters or some other condition.
To Store the WebElement in list you can use driver.findEelements() this method will return list of type WebElement.

Python crawler not finding specific Xpath

I asked my previous question here:
Xpath pulling number in table but nothing after next span
This worked and i managed to see the number i wanted in a firefox plugin called xpath checker. the results show below.
so I know i can find this number with this xpath, but when trying to run a python scrpit to find and save the number it says it cannot find it.
try:
views = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']/preceding-sibling::text()")
except NoSuchElementException:
print "NO views"
views = 'n/a'
pass
I no that pass is not best practice but i am just testing this at the moment trying to find the number. I'm wondering if i need to change something on the end of the xpath like .text as the xpath checker normally shows a results a little differently. Like below:
i needed to use the xpath i gave rather than the one used in the above picture because i only want the number and not the date. You can see part of the source in my previous question.
Thanks in advance! scratching my head here.
The xpath used in find_element_by_xpath() has to point to an element, not a text node and not an attribute. This is a critical thing here.
The easiest approach here would be to:
get the td's text (parent)
get the span's text (child)
remove child's text from parent's
Code:
span = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']")
td = span.find_element_by_xpath('..')
views = td.text.replace(span.text, '').strip()

Searching on class names with a space (\s) Python lxml

I wonder if anybody can help :)
I am using python lxml and cssselector to scrape data from HTML pages.
I can select most classes with ease using this method and find it very convenient but I am having a problem selecting class names with a space
For example I want to extract Blah from the following class:
<li class="feature height">Blah blah</li>
I have tried using the following css selectors without success - whole path is not included as this is not the problem
li.feature.height
li.feature height
li.feature:height
Anybody know how to do this? I can't find the answer and am sure it must be a fairly common thing that people need to do...
I cannot just select the parent element
li.feature
as the data is not in the same order on different pages, same applies for nth element selections...
Scratching my head on this a while now and searched alot, hope somebody knows!
I can work around it by getting the data using re's and that works but I wonder if there is a simple solution...
Thanks for you help in advance!
Matt
Extra information as requested - it doesn't work because it returns an empty list or a negative result for boolean
so if use the
css_9_seed_height = 'html body div.seedicons ul li.feature.height'
# 9. Get seed_height
seed_height_obj = root.cssselect(css_9_seed_height)
print seed_height_obj
This returns an empty list - ie the class is not found but its there
You can assume that root.cssselect() works correctly as I am retrieving lots of other info in the same way

Categories

Resources