Capturing html source as text and then searching with xpath

Capturing html source as text and then searching with xpath - python

Using Selenium in Python.
I have a scraping tool that opens a site, grabs text elements using XPath, cleans those elements, and closes the page. The tool is getting too bulky to clean the elements while the driver is still connected. So what I want to do is instead open the page, grab the entire HTML source, close the page, and then grab what I want out of the page using xpaths. But since the page is now just text, I'm unable to use the XPath methods in selenium. Any recommendations?

Related

How to extract embedded link from a webpage, having no iframe and not showing anthing on the network tab...?

This Image shows my problem
In the above image, the link inside the tag is the clickable link; it triggers a prompt to download the pdf file whose actual source link is https://lms.nust.edu.pk/portal/pluginfile.php/1504453/mod_resource/content/0/APG-Mutual-Evaluation-Report-Pakistan-October%202019.pdf
I am using Selenium to find the links specified by XPath like this
bigger_tag = driver.find_elements(By.XPATH, "//div[#class='activityinstance']//a[#class='aalink'][contains(#href, 'https://lms.nust.edu.pk/portal/mod/resource/view.php?') or contains(#href, 'https://lms.nust.edu.pk/portal/mod/url/view.php')]")
How do I extract such links from the webpage?
Since the site I am trying to scrape is a protected site and requires login credentials hence sharing the code would be fruitless here. I just want to know what's the standard procedure in a case where you can't find embedded links in the developer's tools. No Iframe, No Server request visible in the Network tab. Nothing.

Why does a list appear as a comment with Python Beautiful Soup?

I am trying to scrape the addresses of Dunkin' locations using this website: https://www.dunkindonuts.com/en/locations?location=10001. However, when trying to access the list of each Dunkin' on the web page, it shows up as comment. How do I access the list? I've never done web scraping before.
Here's my current code, I'm expecting a list of Dunkin' stores which I can then extract the addresses from.

requests.get() will return the raw HTML for a web page. This is only the beginning of the journey when you view this page in the browser. Your browser will parse that HTML to create the DOM. It will load other resources, such as images and scripts from other files. Then it will execute those scripts. In the modern web, those scripts will modify the DOM to give the page that you finally see in the browser. requests alone doesn't give you all that.
One solution is to use a library that loads the HTML into a browser and does all of the magic. selenium is one such library.

Get current HTML from browser tab with Python

I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks

I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.

How to use Python to scrape all the table contents on this website which is written by AJAX?

https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.

You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.

Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.

I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capturing html source as text and then searching with xpath - python

Related

How to extract embedded link from a webpage, having no iframe and not showing anthing on the network tab...?

Why does a list appear as a comment with Python Beautiful Soup?

Get current HTML from browser tab with Python

How to use Python to scrape all the table contents on this website which is written by AJAX?

Selenium Python: clicking links produced by JSON application

Categories

Resources