I am pretty new to selenium and I am trying to figure out how to simulate a onclick
this is what I see in the source code when I inspect the html source
<img src="images/ListingOptionSearch.jpg" onmouseover="this.src='images/ListingOptionSearchHover.jpg'" onmouseout="this.src='images/ListingOptionSearch.jpg'">
I tried :
driver.find_element_by_css_selector("a[onlick*=document.getElementById('pN')
.selectedIndex]").click()
but I get a InvalidSelectorException
any idea?
thanks!
You can use
from selenium import webdriver
browser = webdriver.Chrome(somepath) # You should know what this does.
browser.execute_script("document.getElementById('pN').selectedIndex = 0;document.getElementById('optionList').submit();return false")
Meaning you can execute Javascript code just by using .execute_script , Awesome, right?
An invalid InvalidSelectorException is an expecting raised when there's no element or from experience, there could be an iframe and you'll have to use .switch_to.frame to be able to interact with it.
Also, I like using XPath (most reliable always), it takes a little bit time getting used to, but with an hour or two of practising you can get it.
JeffC has a good point, the structure of the HTML, JS can always change.
You can use the find_element_by_xpath(xpath).click() but there are also more dynamic ways to predict wether the structure is going to change, using something like find_element_by_nameor other that are available:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
Did you google the exception? It means that your selector is not valid. CSS selectors like the one you are trying to use are in the form tag[attribute='value']. You don't have the value surrounded by quotes... which is not going to be possible in this specific case because your value already contains single quotes.
Because the A tag encloses the IMG tag, you can click on the IMG tag and get the same effect. A CSS selector like the below should work.
img[src='images/ListingOptionSearch.jpg']
There are other selectors that would probably work but with a link to the page, etc. I would be just guessing as to whether they would be unique.
Related
I am using python / selenium to archive some posts. They are simple text + images. As the site requires a login, I'm using selenium to access it.
The problem is, the page shows all the posts, and they are only fully readable on clicking a text labeled "read more", which brings up a popup with the full text / images.
So I'm writing a script to scroll the page, click read more, scrape the post, close it, and move on to the next one.
The problem I'm running into, is that each read more button is an identical element:
read more
If I try to loop through them using XPaths, I run into the problem of them being formatted differently as well, for example:
//*[#id="page"]/div[2]/article[10]/div[2]/ul/li/a
//*[#id="page"]/div[2]/article[14]/div[2]/p[3]/a
I tried formatting my loop to just loop through the article numbers, but of course the xpath's terminate differently. Is there a way I can add a wildcard to the back half of my xpaths? Or search just by the article numbers?
/ is used to go for direct child, use // instead to go from <article> to the <a>
//*[#id="page"]/div[2]/article//a[.="read more"]
This will give you a list of elements you can iterate. You might be able to remove the [.="read more"], but it might catch unrelated <a> tags, depends on the rest of the html structure.
You can also try looking for the read more elements directly by text
//a[.="read more"]
I recommend using CSS Selectors over XPaths. CSS Selector provide faster, cleaner and simpler way to deal with these queries.
('a[href^="javascript"]')
This will selects every element whose href attribute value begins with "javascript" which is what you are looking for...
You can learn more about Locating Elements by CSS Selectors in selenium here.
readMore = driver.find_element(By.CSS_SELECTOR, 'a[href^="javascript"]')
And about Locating Hyperlinks by Link Text
readMore_link = driver.find_elements(By.LINK_TEXT, 'javascript')
I am trying to find the textbox element using the find_element_by_xpath() method, but It keeps telling me it cant find said element, here's the line of code that does that.
I've tried finding it by link_text, partial link text, selector and it just doesn't work
bar = nav.find_element_by_xpath('//*[#id="react-root"]/div/div/div[2]/main/div/div/div/div[2]/div/div/aside/div[2]/div[2]/div/div/div/div/div[1]/div/div/div/div[2]/div/div/div/div')
Thanks in advance!
So, I suggest creating your xpath if you want to be precise and avoid taking it based on html structure (which can change).
The locators looks like:
And you can take it with xpath:
//input[#placeholder='Search people' and #role='combobox']
To avoid this problem, I suggest going trough a tutorial for a better understanding regarding how to create custom locators: Xpath tutorial
What was happening is: When I opened the tab to inspect the element, the DM structure changed because of my screen size so the xpath wasn't the same
For the moment, I'm looking to make a program for a repetitive action which we need to make something like 1000 times by hand otherwise.
This action is done throughout a web browser (I'm using Chrome). My actual issue is the XPATH selector is changing at every connection but only one number. So, I use the recognition on the webpage linked using Selenium and associated WebDrivers.
The fact is my code run sometimes when the selector has the right name.
Indeed, as the css selector is changing permanently, it happens that this is the right one !
So, after making a headless browser, login to the company webpage, I have to recognize then click on a specific object on the navigator :
The problematic code is the following:
wait.until(EC.presence_of_element_located((By.XPATH, '//*[#id="__xmlview0--settingsButton-img"]')))
OT = driver.find_element_by_xpath('//*[#id="__xmlview0--settingsButton-img"]')
OT.click()
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#__select1-label')))
driver.save_screenshot("screenshot.png")
I have an idea but I don't know how to do it: Is it possible to add a random number instead of the 0 in xmlview0, which is the number issue within the CSS selector ?
I'm not a Python veteran and I really don't want to do the job by hand.
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).
This is a problem that I always have getting a specific XPath with my browser.
Assume that I want to extract all the images from some websites like Google Image Search or Pinterest. When I use Inspect element then use copy XPath to get the XPath for an image, it gives me some thing like following :
//*[#id="rg_s"]/div[13]/a/img
I got this from an image from Google Search. When I want to use it in my spider, I used Selector and HtmlXPathSelector with the following XPaths, but they all don't work!
//*[#id="rg_s"]/div/a/img
//div[#id="rg_s"]/div[13]/a/img
//[#class="rg_di rg_el"]/a/img #i change this based on the raw html of page
#hxs.select(xpath).extract()
#Selector(response).xpath('xpath')
.
.
I've read many questions, but I couldn't find a general answer to how I can use XPaths obtained from a web browser in Scrapy.
Usually it is not safe and reliable to blindly follow browser's suggestion about how to locate an element.
First of all, XPath expression that developer tools generate are usually absolute - starting from the the parent of all parents - html tag, which makes it being more dependant on the page structure (well, firebug can also make expressions based on id attributes).
Also, the HTML code you see in the browser can be pretty much different from what Scrapy receives due to asynchronous nature of the website page load and javascript being dynamically executed in the browser. Scrapy is not a browser and "sees" only the initial HTML code of a page, before the "dynamic" part.
Instead, inspect what Scrapy really has in the response: open up the Scrapy Shell, inspect the response and debug your XPath expressions and CSS selectors:
$ scrapy shell https://google.com
>>> response.xpath('//div[#id="myid"]')
...
Here is what I've got for the google image search:
$ scrapy shell "https://www.google.com/search?q=test&tbm=isch&qscrl=1"
In [1]: response.xpath('//*[#id="ires"]//img/#src').extract()
Out[1]:
[u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRO9ZkSuDqt0-CRhLrWhHAyeyt41Z5I8WhOhTkGCvjiHmRiTSvDBfHKYjx_',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQpwyzbW_qsRenDw3d4wwpwwm8n99ukMtLCVaPiTJxyviyQVBQeRCglVaY',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSrxtoY3-3QHwhjc5Ofx8090uDYI8VOUbi3gUrd9USxZ-Vb1D5pAbOzJLMS',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTQO1A3dDJ07tIaFMHlXNOsOnpiY_srvHKJE1xOpsMZscjL3aKGxaGLOgru',
u'https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQ71ukeTGCPLuClWd6MetTtQ0-0mwzo3rn1ug0MUnbpXmKnwNuuBnSWXHU',
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRZmWrYR9A4W97jpjhtIbyUM5Lj3vRL0vgCKG_xfylc5wKFAk6UB8jiiKA',
...
u'https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcRj08jK8sBjX90Tu1RO4BfZkKe5A59U0g1TpMWPFZlNnA70SQ5i5DMJkvV0']
The XPath generated from an insertion point in a browser is bound to be brittle because there are many different possible XPath expressions to reach any given node, JavaScript can modify the HTML, and the browser doesn't know your intentions.
For the example you gave,
//*[#id="rg_s"]/div[13]/a/img
the 13th div is particularly prone to breakage.
Try instead to find a uniquely identifying characteristic closer to your target. A unique #id attribute would be ideal, or a #class that uniquely identifies your target or a close ancestor of your target can work well too.
For example, for Google Image Search, something like the following XPath
//div[#id='rg_s']//img[#class='rg_i']"
will select all images of class rg_i within the div containing the search results.
If you're willing to abandon the copy-and-paste approach and learn enough XPath to generalize your selections, you'll get much better results. Of course, standard disclaimers apply about changes to presentation necessitating updating of scraping techniques too. Using a direct API call would be much more robust (and proper as well).