Extracting info from webpage via python

Extracting info from webpage via python - python

I'd like to ask somebody with experience with headless browsers and python if it's possible to extract box info with distance from closest strike on webpage below. Till now I was using python bs4 but since everything is driven by jQuery here simple download of webpage doesn't work. I found PhantomJS but I wasn't able extract it too so I am not sure if it's possible. Thanks for hints.
https://lxapp.weatherbug.net/v2/lxapp_impl.html?lat=49.13688&lon=16.56522&v=1.2.0

This isn't really a Linux question, it's a StackOverflow question, so I won't go into too much detail.
The thing you want to do can be easily done with Selenium. Selenium has both a headless mode, and a heady mode (where you can watch it open your browser and click on things). The DOM query API is a bit less extensive than bs4, but it does have nice visual query (location on screen) functions. So you would write a Python script that initializes Selenium, goes to your website and interacts with it. You may need to do some image recognition on screenshots at some point. It may be as simple as finding for a certain query image on the screen, or something much more complicated.
You'd have to go through the Selenium tutorials first to see how it works, which would take you 1-2 days. Then figure out what Selenium stuff you can use to do what you want, that depends on luck and whether what you want happens to be easy or hard for that particular website.
Instead of using Selenium, though, I recommend trying to reverse engineer the API. For example, the page you linked to hits https://cmn-lx.pulse.weatherbug.net/data/lightning/v1/spark with parameters like:
_
callback
isGpsLocation
location
locationtype
safetyMessage
shortMessage
units
verbose
authid
timestamp
hash
You can figure out by trial and error which ones you need and what to put in them. You can capture requests from your browser and then read them yourself. Then construct appropriate requests from a Python program and hit their API. It would save you from having to deal with a Web UI designed for humans.

Related

Python - Download Excel file from clickable "XLS" button of website without using Selenium

I am trying to write a program that goes to the following site and downloads the Excel file that automatically downloads when clicking the XLS button on the bottom of the page. To be honest, I am quite new to programming.
Site: https://echa.europa.eu/assessment-regulatory-needs
At first I tried using Selenium to let the program click itself through the browser and really click on the button. However, I think the website detects the usage of automated software and I cannot bypass the disclaimer that appears when opening the website.
Then I read some answers on the same topic where it is possible by using the requests module. However, I could not get it to work.
One thing I think I understood from this anwer was that you need to get the site/server where the data is requested from by inspecting the button with F12 in the browser. I tried this, and thought I had it, however I cannot get the code to function. I learnt from this answer that you need to give the referer as well, bu I think the referer from this file ist only partially written out, as it is "https://echa.europa.eu/assessment-regulatory-needs".
This answer explained the network process more in detail, howver I am not able to recreate it. Also, to be honest i do not fully understand what the API is and how to search for it.
I also found this answer, but it does not work for me either.
So I am asking for help on this, as I think that my HTML, Java, Website, Python knowledge is too tiny to see what I have to change to be able to download the excel file.

Python Selenium with Salesforce - Cannot Seem to Access Certain Form Elements

Using Selenium to try and automate a bit of data entry with Salesforce. I have gotten my script to load a webpage, allow me to login, and click an "edit" button.
My next step is to enter data into a field. However, I keep getting an error about the field not being found. I've tried to identify it by XPATH, NAME, and ID and continue to get the error. For reference, my script works with a simple webpage like Google. I have a feeling that clicking the edit button in Salesforce opens either another window or frame (sorry if I'm using the wrong terminology). Things I've tried:
Looking for other frames (can't seem to find any in the HTML)
Having my script wait until the element is present (doesn't seem to work)
Any other options? Thank you!

Salesforce's Lighting Experience (the new white-blue UI) is built with web components that hide their internal implementation details. You'd need to read up a bit about "shadow DOM", it's not a "happy soup" of html and JS all chucked into top page's html. Means that CSS is limited to that one component, there's no risk of spilling over or overwriting another page area's JS function if you both declare function with same name - but it also means it's much harder to get into element's internals.
You'll have to read up about how Selenium deals with Shadow DOM. Some companies claim they have working Lightning UI automated tests/ Heard good stuff about Provar, haven't used it myself.
For custom UI components SF developer has option to use "light dom", for standard UI you'll struggle a bit. If you're looking for some automation without fighting with Lighting Experience (especially that with 3 releases/year SF sometimes changes the structure of generated html, breaking old tests) - you could consider switching over to classic UI for the test? It'll be more accessible for Selenium, won't be exactly same thing the user does - but server-side errors like required fields, validation rules should fire all the same.

BeautifulSoup find returning "None" [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 hours ago.
What is the best method to scrape a dynamic website where most of the content is generated by what appears to be ajax requests? I have previous experience with a Mechanize, BeautifulSoup, and python combo, but I am up for something new.
--Edit--
For more detail: I'm trying to scrape the CNN primary database. There is a wealth of information there, but there doesn't appear to be an api.

The best solution that I found was to use Firebug to monitor XmlHttpRequests, and then to use a script to resend them.

This is a difficult problem because you either have to reverse engineer the JavaScript on a per-site basis, or implement a JavaScript engine and run the scripts (which has its own difficulties and pitfalls).
It's a heavy weight solution, but I've seen people doing this with GreaseMonkey scripts - allow Firefox to render everything and run the JavaScript, and then scrape the elements. You can even initiate user actions on the page if needed.

Selenium IDE, a tool for testing, is something I've used for a lot of screen-scraping. There are a few things it doesn't handle well (Javascript window.alert() and popup windows in general), but it does its work on a page by actually triggering the click events and typing into the text boxes. Because the IDE portion runs in Firefox, you don't have to do all of the management of sessions, etc. as Firefox takes care of it. The IDE records and plays tests back.
It also exports C#, PHP, Java, etc. code to build compiled tests/scrapers that are executed on the Selenium server. I've done that for more than a few of my Selenium scripts, which makes things like storing the scraped data in a database much easier.
Scripts are fairly simple to write and alter, being made up of things like ("clickAndWait","submitButton"). Worth a look given what you're describing.

Adam Davis's advice is solid.
I would additionally suggest that you try to "reverse-engineer" what the JavaScript is doing, and instead of trying to scrape the page, you issue the HTTP requests that the JavaScript is issuing and interpret the results yourself (most likely in JSON format, nice and easy to parse). This strategy could be anything from trivial to a total nightmare, depending on the complexity of the JavaScript.
The best possibility, of course, would be to convince the website's maintainers to implement a developer-friendly API. All the cool kids are doing it these days 8-) Of course, they might not want their data scraped in an automated fashion... in which case you can expect a cat-and-mouse game of making their page increasingly difficult to scrape :-(

There is a bit of a learning curve, but tools like Pamie (Python) or Watir (Ruby) will let you latch into the IE web browser and get at the elements. This turns out to be easier than Mechanize and other HTTP level tools since you don't have to emulate the browser, you just ask the browser for the html elements. And it's going to be way easier than reverse engineering the Javascript/Ajax calls. If needed you can also use tools like beatiful soup in conjunction with Pamie.

Probably the easiest way is to use IE webbrowser control in C# (or any other language). You have access to all the stuff inside browser out of the box + you dont need to care about cookies, SSL and so on.

i found the IE Webbrowser control have all kinds of quirks and workarounds that would justify some high quality software to take care of all those inconsistencies, layered around the shvwdoc.dll api and mshtml and provide a framework.

This seems like it's a pretty common problem. I wonder why someone hasn't anyone developed a programmatic browser? I'm envisioning a Firefox you can call from the command line with a URL as an argument and it will load the page, run all of the initial page load JS events and save the resulting file.
I mean Firefox, and other browsers already do this, why can't we simply strip off the UI stuff?

How to get visible text from a webpage using Selenium & python?

I am trying to grab a bunch numbers that are presented in a table on a web page that I’ve accessed using python and Selenium running headless on a Raspberry Pi. The numbers are not in the page source, rather they are deeply embedded in complex html served by several URLs called by the main page (the numbers update every few seconds). I know I could parse the html to get the numbers I want, but the numbers are already sitting on the front page in perfect format all in one place. I can select and copy the numbers when I view the web page in Chrome on my PC.
How can I use python and get Selenium webdriver to get me those numbers? Can Selenium simply provide all the visible text on a page? How? (I've tried driver.page_source but the text returned does not contain the numbers). Or is there a way to essentially copy text and numbers from a table visible on the screen using python and Selenium? (I’ve looked into xdotool but didn’t find enough documentation to help). I’m just learning Selenium so any suggestions will be much appreciated!

Well, I figured out the answer to my question. It's embarrassingly easy. This line gets just what I need - all the text that is visible on the web page:
page_text = driver.find_element_by_tag_name('body').text

So, there are some different situations why you can not get some info on the page:
Information doesn't loaded yet. You must waiting for some time to get your information ready. You may watch this theme for the better understanding. Some times you get dynamically added page elements with JS and so on, which loading is very slowly.
Information may consists of different type of data. For example you are waiting for a text with numbers, but you may get picture with numbers on the page. In this situation you must change your programming tactics and use another functions to get what you need.

I am trying to scrape this website for all of the documents that are produced from the drop down forms

The site I am trying to scrap has drop-down menus that end up producing a link to a document. The end documents are what I want. I have no experience with web scraping so I don't know where to start on this. I don't know where to start. I have tried adapting this to my needs, but I couldn't get it working. I also tried to adapt this.
I know basically I need to:
for state in states:
select state
for type in types:
select type
select wage_area_radio button
for area in wage_area:
select area
for locality in localities:
select locality
for date in dates:
select date
get_document
I just haven't found anything that works for me yet. Is there a tool better than Selenium for this? I am currently trying to bend it to my will using the the code from my second example as a starter.

Depending on your coding skills and knowledge of HTTP, I would try one of two things. Note that scraping this site appears slightly non-trivial because of the different form options that appear based on what was previously selected, and the fact that there's a lot of AJAX calls happening.
1) Follow the HTTP requests (especially the AJAX ones) that are being made in something like Chrome DevTools. You'll get a good understanding of how the final URL is being formed and how to construct it yourself. In particular, it looks like the last POST to AFWageScheduleYearSelected is the one that generates the final url. Then, you can make these calls yourself in a Python HTTP library to get the documents.
2) Use something like PhantomJS (http://phantomjs.org/) which is a headless browser. I don't have experience scraping with Selenium, but my understanding is that it is more of a testing/automation tool. In any case, PhantomJS is pretty easy to get up and running and you can basically click page elements, fill out forms, etc.
If you do end up using PhantomJS (or any other browser-like tool), you'll run into issues with the AJAX calls that populate the forms. Basically, you'll end up trying to fill out forms that don't yet exist on the page because the data is still being sent over the network. The easiest way to get around this is to just set timeouts (of say 2 seconds) in between each form field that you fill out. The alternative to using timeouts (which may be unreliable and slow) is to continuously poll the page until the AJAX call is finished.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.