Selenium Python: clicking links produced by JSON application - python

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.

Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.

I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Related

Python/Selenium: Any way to wildcard the end of an xpath? Or search for a specifically formatted piece of an xpath?

I am using python / selenium to archive some posts. They are simple text + images. As the site requires a login, I'm using selenium to access it.
The problem is, the page shows all the posts, and they are only fully readable on clicking a text labeled "read more", which brings up a popup with the full text / images.
So I'm writing a script to scroll the page, click read more, scrape the post, close it, and move on to the next one.
The problem I'm running into, is that each read more button is an identical element:
read more
If I try to loop through them using XPaths, I run into the problem of them being formatted differently as well, for example:
//*[#id="page"]/div[2]/article[10]/div[2]/ul/li/a
//*[#id="page"]/div[2]/article[14]/div[2]/p[3]/a
I tried formatting my loop to just loop through the article numbers, but of course the xpath's terminate differently. Is there a way I can add a wildcard to the back half of my xpaths? Or search just by the article numbers?
/ is used to go for direct child, use // instead to go from <article> to the <a>
//*[#id="page"]/div[2]/article//a[.="read more"]
This will give you a list of elements you can iterate. You might be able to remove the [.="read more"], but it might catch unrelated <a> tags, depends on the rest of the html structure.
You can also try looking for the read more elements directly by text
//a[.="read more"]
I recommend using CSS Selectors over XPaths. CSS Selector provide faster, cleaner and simpler way to deal with these queries.
('a[href^="javascript"]')
This will selects every element whose href attribute value begins with "javascript" which is what you are looking for...
You can learn more about Locating Elements by CSS Selectors in selenium here.
readMore = driver.find_element(By.CSS_SELECTOR, 'a[href^="javascript"]')
And about Locating Hyperlinks by Link Text
readMore_link = driver.find_elements(By.LINK_TEXT, 'javascript')

Get current HTML from browser tab with Python

I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks
I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.

Selenium scraping with HTML changing after refresh

I am using Selenium along with python to scrape some pages. I have many web pages that represent the same type of objects(football player information) but each of them has a slightly different HTML layout. In particular my main issue here is that the div class identifiers change when refreshing or changing web page, in a way which is unpredictable.
In the specific case I would like to get the data in the div which class identifier "jss176", but when I get to another player this will change to "jss450" for example, with no meaningful pattern to be found.
Is there a way I can go around this? I was thinking of navigating through the Childs starting from div with id = "root", but I don't seem to find a good piece of code to achieve this.
Thank you very much!
If only the id's change, but not the web structure, you could scrape the info by XPATH.
https://www.tutorialspoint.com/what-is-xpath-in-selenium-with-python
You can directly access the div you want and select in chrome "copy XPATH" option in the browser.

How to scrape source code of a page without clicking on expand button?

This particular website has a 'show more' button. To load more data from a table. But this data seems to be loaded at the start, because I can click on it and expand the table even in offline mode.
Is there a way to scrape the whole source code in one go without clicking this button many times over in Selenium? Since it seems the entire table is loaded initially when the page is first loaded.
driver.get_source does not show the whole thing in this case, only what is visibly showing when opening the browser.
Using Python, Selenium with Google Chrome.
If indeed all the data is loaded at start, then it can surely be found by looking at the DOM (at the tag or possibly any other tag containing the data). Easy way to do that is open up the console (F12) and use the inspect element tool provided by your browser
Now to answer your question, I'm going to scrape the data using BeautifulSoup, at the found location (tag).I've seen that scraping with Selenium is pretty similar with BeautifulSoup, so you might just get the concept
For example, your table resides in a div (having random attributes, let's say a class called 'randomclass'). The table tag is 'ul', and each entry is stored in a 'li', or specifically in a 'li'.text()
To select the div:
selected_div = soup.find('div', attrs={'class': 'randomclass'})
To select the table inside the div:
table = selected_div.find('ul')
To iterate through the table rows and manage data:
for li in table.find_all('li'):
mylist.append(li.text())

How to scrape HTML rendered by JavaScript

I need to write an automated scraper that can take care of websites that are rendered by JavaScript (like YouTube) or just simply use some JavaScript somewhere in their HTML to generate some content (like generating copyright year) and therefore downloading their HTML source make no sense as it won't be the final code (with what users will see).
I use Python with Selenium and WebDriver, so that I can execute JavaScript on a given website. My code for that purpose is:
def execute_javascript_on_website(self, js_command):
driver = webdriver.Firefox(firefox_options = self.webdriver_options, executable_path = os.path.dirname(os.path.abspath(__file__)) + '/executables/geckodriver')
driver.get(self.url)
try:
return driver.execute_script(js_command)
except Exception as exception_message:
pass
finally:
driver.close()
Where js_command = "return document.documentElement.outerHTML;".
By this code I'm able to get the source code, but not the rendered one. I can do js_command = "return document;" (as I would do in console), but than I will get <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="5a784804-f623-3041-9840-03f13ce83f53", element="585b43a1-f3b2-1e4a-b348-4ddaf2944550")> object that has the HTML but it's not possible to get it out of it.
Does anyone know about the way how to get HTML rendered by JavaScript (ideally in form of string), using Selenium? Or some other technique that would do it?
PS.: I also tried WebDriver wait, but it didn’t help, I still got HTML with unredered JavaScript.
PPS.: I need to get whole HTML code (whole html tag) with JavaScript rendered in it (as it is for example when inspecting in browsers inspector). Or at least to get DOM of the website in which JavaScript is already rendered.
driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
I've looked into it and I have to admit that JavaScript in #Rumpelstiltskin Koriat's answer works. The current year is present in the returned HTML string, it's placed after the script tag (that as #pguardiario mentioned it has to be there, as it's just HTML tag). I've also found out that in this case of simple JavaScript code from script tags, the WebriverWait is not even needed to obtain the HTML string with a rendered JavaScript code. Apparently I've somehow manged to overlook the rendered by JavaScript string I was so eagerly looking for.
What I've also found (as #Corey Goldberg suggested) is that Selenium methods also works well, while looking better than pure JavaScript line: driver.find_element_by_tag_name('html').get_attribute('innerHTML'). It then returns a string and not any webelement.
On the other hand, when there is a need to scrape a whole HTML of the Angular powered website, it's necessary to locate ideally (at least in the case of YouTube website) it's tag with id="content" (and then include this locating at the beginning of all XPaths used later in the code - simulating that we have a whole HTML) or some tag inside this one. WebriverWait was also not needed here as well.
But when locating just HTML tag or yt-app tag or any other tag outside of the one with id="content" an HTML with unrendered JavaScript is returned then. HTML in the Angular generated websites is mixed with Agular's own tags (that browsers apparently ignores).

Categories

Resources