How to get visible text from a webpage using Selenium & python?

How to get visible text from a webpage using Selenium & python? - python

I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!

Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text

So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.

Related

Extracting info from webpage via python

I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0

This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.

How do I check if a website is responsive using python?

I am using python3 in combination with beautifulsoup.
I want to check if a website is responsive or not. First I thought checking the meta tags of a website and see if there is something like this in it:
content="width=device-width, initial-scale=1.0
Accuracy is not that good using this method but I have not found something better.
Has anybody an idea?
Basically I want to do the same as Google did it here: https://search.google.com/test/mobile-friendly reduced to the output if the website is responsive or not (Y/N)

(Just a suggestion)
I am not an expert on this but my first thought is that you need to render the website and see if it "responds" to different screen sizes. I would normally use something like phantomjs to do this.
Apparently, you can do this in python with selenium (more info at https://stackoverflow.com/a/15699761/3727050). A more comprehensive list of technologies that can be used for this task can be found here. Note that these resources seem a bit old/outdated and some solutions fallback to python subprocess calling phantomjs.
The linked google test seems to
Load the page in a small browser and check:
The font-size to be readable
The distance between clickable elements to ensure the page is usable
I would however do the following:
Load the page in desktop mode, record each div's style.
Gradually reduce the size of the screen and see which percentage of these change style
In most cases, from a large screen to a phone size you should be seeing 1-3 distinct layouts which should be identifiable from the percentage of elements changing style
The above does not guarantee that the page is "mobile-friendly" (ie usable in a mobile) but it shows if the CSS are responsive.

Filling forms on different websites using Selenium and Python

I'm a beginner to Python and trying to start my very first project, which revolves around creating a program to automatically fill in pre-defined values in forms on various websites.
Currently, I'm struggling to find a way to identify web elements using the text shown on the website. For example, website A's email field shows "Email:" while another website might show "Fill in your email". In such cases, finding the element using ID or name would not be possible (unless I write a different set of code for each website) as they vary from website to website.
So, my question is, is it possible to write a code where it will scan all the fields -> check the text -> then fill in the values based on the texts that are associated with each field?

It is possible if you know the markup of the page, and you can write code to parse this page. In this case you should use xpath, lxml, beautiful soup, selenium etc. You can look at many manuals on google or youtube, just type "python scraping"
But if you want to write a program able to understand random page on a random site and understand what it should do, it is very difficult, it's a complex task with using machine learning. I guess this task is completely not for beginners.

How to get renewable information on a web by python3?

I want to get some information on a web page. I use requests.get to abstract the page. But I cannot find what I want. Checking it carefully, I found the info I want is in a list with a scrollbar. When I drag scrollbar down, more and more info is loaded. So I guess all the info in the list is not loaded yet when I get the page by module requests. I want to know what is happened in this process and How can I gather the information I want. (I am not familiar with Html language).

I want to know what is happened in this process
It sounds like when the user scrolls, the scrolling causes some javascript(js) to execute, and the js makes repeated requests to the server for more data. Unfortunately, the requests module cannot cause the javascript on an html page to execute--all you get back is the text of the js. The unable to execute javascript on an html page in order to retrieve what the user actually sees has been a problem for a long time. Fortunately, smart programmers have largely solved that problem. You need to use a different module. Check out the selenium module.
I am not familiar with Html language
Scraping web pages can get really tricky really fast, and some web pages proactively try to prevent computer programs from scraping their content, so you need to know both html and js in order to figure out what is going on.

How to scrape AJAX website?

In the past, I've used the urllib2 library to get source codes from websites. However, I've noticed that for a recent website I've been trying to play with, I can't find the information I need in the source code.
http://www.wgci.com/playlist is the site that I've been looking at, and I want to get the most recently played song and the playlist of recent songs. I essentially want to copy and paste the visible, displayed text on a website and put it in a string. Alternatively, being able to access the element that holds these values in plaintext and get them using urllib2 normally would be nice. Is there anyway to do either of these things?
Thanks kindly.

The website you want to scrap is using ajax calls to populate it's pages with data.
You have 2 ways to scrapping data from it:
Use a headless browser that supports javascript (ZombieJS for instance), and scrap the generated output, but that's complicated and overkill
Understand how their API work, and call that directly, which is way simpler.
Use Chrome dev tools (network tab) to see the calls while browsing their website.
For example, the list of last played songs for a given stream is available in JSON at
http://www.wgci.com/services/now_playing.html?streamId=841&limit=12

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.