Get current HTML from browser tab with Python - python

I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks

I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.

Related

Capturing html source as text and then searching with xpath

Using Selenium in Python.
I have a scraping tool that opens a site, grabs text elements using XPath, cleans those elements, and closes the page. The tool is getting too bulky to clean the elements while the driver is still connected. So what I want to do is instead open the page, grab the entire HTML source, close the page, and then grab what I want out of the page using xpaths. But since the page is now just text, I'm unable to use the XPath methods in selenium. Any recommendations?

How to scrape hidden website using python :: style display:none

I was trying to scrape website and I faced a problem: the data in the website is hidden and, when I clicked the "+" sign it showed the result.
How do I scrape this data using python?
<tr class="ob_gDGC" style="display: none;">
The style only denotes what the screen displays not what the document is, so display:none doesn't restrict you from accessing the data.
However if the data you are trying to access is not on the dom then you have a problem. View the page in dev tools to see if the data is there before you click the button. If you click the button and it appends children (or the dom node flashes in google chrome dev tools) then the website you are trying to scrape uses javascript dom manipulation and this is difficult to impossible to extract with the requests library. For that you would be looking for the package like pyppeteer (or equivalent). With that you could get a web page and simulate the click event on "the plus sign" and then extract your required data.
I would advise you modify your post to be a bit clearer and add a example of the dom you are trying to scrape.

Access widget window beautifulsoup python mechanize

I am trying to scrape information off websites like this:
https://www.glassdoor.com/Overview/Working-at-7-Eleven-EI_IE3581.11,19.htm
using python + beautifulsoup + mechanize.
Accessing anything on the main-site is no problem. However, I also need the information that appears in a overlay-window that appears when one clicks on the "Rating Trends" button next to the bar with stars.
This overlay-window can also be accessed directly by using the url:
https://www.glassdoor.com/Reviews/7-Eleven-Reviews-E3581.htm#trends-overallRating
The html associated with this page is a modification of the original site's html.
However, regardless of what element I try to find (via findAll ) on that overlay-window website, beautifulsoup returns zero hits.
How can I fix this? I tried adding a sleep time between accessing the website and reading anything in, to no avail.
Thanks!
If you're using the Chrome browser select the background of that page (without the additional information displayed) and select 'Inspect' from the context menu (for Windows anyway), then the 'Network' tab, so that you can see network traffic. Now click on 'Rating trends'. The entry marked 'xhr' will be https://www.glassdoor.ca/api/employer/3581-rating.htm?locationStr=&jobTitleStr=&filterCurrentEmployee=false&filterEmploymentStatus=REGULAR&filterEmploymentStatus=PART_TIME (I much hope!) and its contents will be the following.
{"employerId":3581,"ratings":[{"hasRating":true,"type":"overallRating","value":2.9},{"hasRating":true,"type":"ceoRating","value":0.54},{"hasRating":true,"type":"bizOutlook","value":0.35},{"hasRating":true,"type":"recommend","value":0.4},{"hasRating":true,"type":"compAndBenefits","value":2.4},{"hasRating":true,"type":"cultureAndValues","value":2.5},{"hasRating":true,"type":"careerOpportunities","value":2.5},{"hasRating":true,"type":"workLife","value":2.4},{"hasRating":true,"type":"seniorManagement","value":2.3}],"week":0,"year":0}
Whether this URL can be altered for use in obtaining information for other employers, I regret, I cannot tell you.

Fetching a page which needs user interaction

In Python, I am trying to fetch pages from a specific website.
In this website, there are some parts in which the information is not completely accessible in the HTML page, and needs a bit of user interaction. To be more clear, there are some reviews, but the long reviews are shortened, and to see to whole review user must click on 'More' hyperlink. Is there any way to handle these hyperlinks in Python and fetch the whole reviews for all those cases?
Here is a snapshot of the 'More' hyperlink:
<span class="bla bla" onclick="ta.util.cookie.setPIDCookie(123); ta.call('ta.servlet.Reviews.expandReviews',event,this,'review_331979201', '1', 123);"> More </span>
you could use selenium webdriver api for example see this
https://www.reddit.com/r/selenium/comments/2lscf4/clicking_a_button_using_selenium_python/
for read complete docs use http://www.seleniumhq.org/docs/
Use Selenium python binding: http://selenium-python.readthedocs.org/
The algorithm may be following:
If "More" hyperlink is not visible in view port - scroll to this element
Click to hyperlink
Fetch all reviews
The similar case for scrolling and clicking on web element: https://stackoverflow.com/a/34271050/2517622

Selenium Python: clicking links produced by JSON application

[ Ed: Maybe I'm just asking this? Not sure -- Capture JSON response through Selenium ]
I'm trying to use Selenium (Python) to navigate via hyperlinks to pages in a web database. One page returns a table with hyperlinks that I want Selenium to follow. But the links do not appear in the page's source. The only html that corresponds to the table of interest is a tag indicating that the site is pulling results from a facet search. Within the div is a <script type="application/json"> tag and a handful of search options. Nothing else.
Again, I can view the hyperlinks in Firefox, but not using "View Page Source" or Selenium's selenium.webdriver.Firefox().page_source call. Instead, that call outputs not the <script> tag but a series of <div> tags that appear to define the results' format.
Is Selenium unable to navigate output from JSON applications? Or is there another way to capture the output of such applications? Thanks, and apologies for the lack of code/reproducibility.
Try using execute_script() and get the links by running JavaScript, something like:
driver.execute_script("document.querySelector('div#your-link-to-follow').click();")
Note: if the div are generated by scripts dynamically, you may want to implicitly wait a few seconds before executing the script.
I've confronted a similar situation on a website with JavaScript (http://ledextract.ces.census.gov to be specific). I had pretty good luck just using Selenium's get_element() methods. The key is that even if not everything about the hyperlinks appears in the page's source, Selenium will usually be able to find it by navigating to the website since doing that will engage the JavaScript that produces the additional links.
Thus, for example, you could try mousing over the links, finding their titles, and then using:
driver.find_element_by_xpath("//*[#title='Link Title']").click()
Based on whatever title appears by the link when you mouse over it.
Or, you may be able to find the links based on the text that appears on them:
driver.find_element_by_partial_link_text('Link Text').click()
Or, if you have a sense of the id for the links, you could use:
driver.find_element_by_id('Link_ID').click()
If you are at a loss for what the text, title, ID, etc. would be for the links you want, a somewhat blunt response is to try to pull the id, text, and title for every element off the website and then save that to a file that you can look for to identify likely candidates for the links you're wanting. That should show you a lot more (in some respects) than just the source code for the site would:
AllElements = driver.find_elements_by_xpath('//*')
for Element in AllElements:
print 'ID = %s TEXT = %s Title =%s' %(Element.get_attribute("id"), Element.get_attribute("text"), Element.get_attribute("title"))
Note: if you have or suspect you have a situation where you'll have multiple links with the same title/text, etc. then you may want to use the find_elements (plural) methods to get lists of all those satisfying your criteria, specify the xpath more explicitly, etc.

Categories

Resources