Node.js scraping with chrome-remote-interface

Node.js scraping with chrome-remote-interface - python

I have been trying to scrape a website protected by Distil Networks,
in which using selenium (with Python) would just always fail.
I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface, like it is the thing that I want, but then I got stuck.
What I would like to do is to automate following steps:
Open a Chrome instance
Navigate to a page
Run some javascript
Collect data and save to file
Repeat steps 2 - 4
I know that I can open a instance of Chrome for debugging by:
google-chrome --remote-debugging-port=9222
And I can open a console on node by:
chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r
I can also run simple scripts like
Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})
But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.
Also , there are not enough documentation on chrome-remote-interface for scraping. Is there any good links for that?

I know it's has been asked two years ago, but let me write it here for documentation purposes.
-- Tools of the trade --
I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.
-- Runtime.evaluate --
One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
Example:
Array.from(document.getElementByTagName('tr'))
.map((e)=>e.children[2].innerHTML)
.filter((e)=>e.length>0)
Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know)
I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:
JSON.stringify(
Array.from(document.getElementByTagName('tr'))
.map((e)=>e.children[2].innerHTML)
.filter((e)=>e.length>0)
)
-- Navigation --
When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate.
I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.

JavaScript expressions evaluated by Runtime.evaluate are executed within the page context just like what happens in the DevTools console.
You can interact with the DOM using the DOM domain, e.g., DOM.getDocument, DOM.querySelector, etc.
Also remember that chrome-remote-interface is mainly a library meaning that it allows you to write your own Node.js applications, the chrome-remote-interface inspect is just an utility.
There are several places where you can get help:
open an issue to chrome-remote-interface;
the chrome-remote-interface wiki;
the Chrome DevTools Protocol Viewer;
the Chrome Debugging Protocol Google Group.
If you ask something more specific I'd be happy to try to help you with that.
Finally you may want to take a look at automated-chrome-profiling, which I think is structurally similar to what you're trying to achieve.

Related

How do I get URLs that are being accessed in my browser in 'real time'?

I want to write a program that returns the current or last visited URL by me on my computer (Windows 10) browser. Is there any way in which I can get that URL?
I tried using Python and SQLite to access Chrome history database on C:\Users%USERNAME%\AppData\Local\Google\Chrome\User Data\Default\History and it worked, but if I'm using the browser, the database gets locked.
I know that by using Wireshark, one can see the packets when accessing an URL, but I cannot find the complete URL in those packets fields, only the server name (ie: stackoverflow.com).
I'd like to know whether there is a way in which I can see that information as it's done by Wireshark, but only to get the complete URL, nothing else. Thank you!

I found a solution to this by using mitmproxy: https://mitmproxy.org/. This video on YouTube helped me with the installation and setup process: https://www.youtube.com/watch?v=7BXsaU42yok. The video explains the installation on Mac, but it's not so different from Windows. Then you can use Python to capture and process the URLs contained within the HTTPS requests by using the flow.request.pretty_url property: https://docs.mitmproxy.org/stable/addons-scripting/.

Selenium - Find out where script was downloaded from

I have to validate that a web application, when executed in the client browser, is fetching some assets (*.js) from a particular remote server.
Say two options exist: whether it gets the script from server A or it gets a copy from server B. I need to assert (based on some preconditions) that the script was downloaded from server A.
The question: Is there a way to inspect source url of loaded javascript using selenium (preferably with python)?

Here it is a possible solution to extract url of javascript libraries from the stackoverflow site.
You should adapt the solution to the site you are working on.
driver = webdriver.Firefox()
driver.get("http://www.http://stackoverflow.com/")
link= driver.find_elements_by_tag_name('script')
for i in link:
print i.get_attribute("src")
Example of output:
http://rules.quantcount.com/rules-p-c1rF4kxgLUzNc.js
http://edge.quantserve.com/quant.js
http://b.scorecardresearch.com/beacon.js
https://www.google-analytics.com/analytics.js
https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js
https://cdn.sstatic.net/Js/stub.en.js?v=9798373e8e81
There are various strategies to locate elements in a page.
You can use the most appropriate one for your case (http://selenium-python.readthedocs.io/locating-elements.html)

find if the xss injection bypassed the filter from python

I'm writing a crawler, (in python 2.7, windows 10) that receive an address of a web page (like https://www.wikipedia.com ) and crawle to find all the pages.
once this is done the program should go on all the pages and try to find an xss vulnerability, my problem is that after injecting the payload I don't know how to verify if the attack succeeded or not (verifying if the payload is in the page isn't good enough because some filter could have changed the payload - without really disabling this. a good example is the payload <scrip<script>t>alert(0)</sc</script>ript> this payload NEEDS the filter to change it in order to work...)
my question is how can I find if an alert (for example) will pop (and if possible without new modules that need to be installed).
thanks!

You can use Selenium and switch_to.alert() to identify the successful invocation of alert.

Emulating a browser to download a file?

There's an FLV file on the web that can be downloaded directly in Chrome. The file is a television program, published by CCTV (China Central Television). CCTV is a non-profit, state-owned broadcaster, financed by the Chinese tax payer, which allows us to download their content without infringing copyrights.
Using wget, I can download the file from a different address, but not from the address that works in Chrome.
This is what I've tried to do:
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
wget -c $url --user-agent="" -O xfgs.f4v
This doesn't work either:
wget -c $url -O xfgs.f4v
The output is:
Connecting to 118.26.57.12:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2013-02-13 09:50:42 ERROR 403: Forbidden.
What am I doing wrong?
I ultimately want to download it with the Python library mechanize. Here is the code I'm using for that:
import mechanize
br = mechanize.Browser()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
url='http://114.80.235.200/f4v/94/163005294.h264_1.f4v?10000&key=7b9b1155dc632cbab92027511adcb300401443020d&playtype=1&tk=163659644989925531390490125&brt=2&bc=0&nt=0&du=1496650&ispid=23&rc=200&inf=1&si=11000&npc=1606&pp=0&ul=2&mt=-1&sid=10000&au=0&pc=0&cip=222.73.44.31&hf=0&id=tudou&itemid=135558267&fi=163005294&sz=59138302'
r = br.open(url).read()
tofile=open("/tmp/xfgs.f4v","w")
tofile.write(r)
tofile.close()
This is the result:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 403: Forbidden
Can anyone explain how to get the mechanize code to work please?

First of all, if you are attempting any kind of scraping (yes this counts as scraping even though you are not necessarily parsing HTML), you have a certain amount of preliminary investigation to perform.
If you don't already have Firefox and Firebug, get them. Then if you don't already have Chrome, get it.
Start up Firefox/Firebug, and Chrome, clear out all of your cookies/etc. Then open up Firebug, and in Chrome open up View->Developer->Developer Tools.
Then load up the main page of the video you are trying to grab. Take notice of any cookies/headers/POST variables/query string variables that are being set when the page loads. You may want to save this info somewhere.
Then try to download the video, once again, take notice of any cookies/headers/post variables/query string variables that are being set when the video is loaded. It is very likely that there was a cookie or POST variable set when you initially loaded the page, that is required to actually pull the video file.
When you write your python, you are going to need to emulate this interaction as closely as possible. Use python-requests. This is probably the simplest URL library available, and unless you run into a wall somehow with it (something it can't do), I would never use anything else. The second I started using python-requests, all of my URL fetching code shrunk by a factor of 5x.
Now, things are probably not going to work the first time you try them. Soooo, you will need to load the main page using python. Print out all of your cookies/headers/POST variables/query string variables, and compare them to what Chrome/Firebug had. Then try loading your video, once again, compare all of these values (that means what YOU sent the server, and what the SERVER sent you back as well). You will need to figure out what is different between them (don't worry, we ALL learned this one in Kindergarten... "one of these things is not like the other") and dissect how that difference is breaking stuff.
If at the end of all of this, you still can't figure it out, then you probably need to look at the HTML for the page that contains the link to the movie. Look for any javascript in the page. Then use Firebug/Chrome Developer Tools to inspect the javascript and see if it is doing some kind of management of your user session. If it is somehow generating tokens (cookies or POST/GET variables) related to video access, you will need to emulate its tokenizing method in python.
Hopefully all of this helps, and doesn't look too scary. The key is you are going to need to be a scientist. Figure out what you know, what you don't, what you want, and start experimenting and recording your results. Eventually a pattern will emerge.
Edit: Clarify steps
Investigate how state is being maintained
Pull initial page with python, grab any state info you need from it
Perform any tokenizing that may be required with that state info
Pull the video using the tokens from steps 2 and 3
If stuff blows up, output your request/response headers,cookies,query vars, post vars, and compare them to Chrome/Firebug
Return to step 1. until you find a solution
Edit:
You may also be getting redirected at either one of these requests (the html page or the file download). You will most likely miss the request/response in Firebug/Chrome if that is happening. The solution would be to use a sniffer like LiveHTTPHeaders, or like has been suggested by other responders, WireShark or Fiddler. Note that Fiddler will do you no good if you are on a Linux or OSX box. It is Windows only and is definitely focused on .NET development... (ugh). Wireshark is very useful but overkill for most problems, and depending on what machine you are running, you may have problems getting it working. So I would suggest LiveHTTPHeaders first.
I love this kind of problem

It seems that mechanize can do stateful browsing, meaning that it will keep context and cookies between browser requests. I would suggest to first load the complete page where the video is located, then do a second try to download the video explicitly. That way, the web server will think that it is a full (legit) browsing session ongoing

you can use selenium or watir to do all the stuff you need in a browser.
since you don't want to see the browser, you can run selenium headless.
see also this answer.

Assuming that you did not type the URL out of the blue by hand, use mechanize to first go to the page where you got that from. Then emulate the action you take to download the actual file (probably clicking a link or a button).
This might not work though as Mechanize keeps state of cookies and redirects, but does not handle any JavaScript real-time changes to the html pages. To check if JavaScript is crucial for the operation, switch of JavaScript in Chrome (or any other browser) and make sure you can download the file. If JavaScript is necessary, I would try and programmatically drive a browser to get the file.
My usual approach to trying this kind of scraping is
try wget or pythons urllib2
try mechanize
drive a browser
Unless there is some captcha, the last one usually works, but the others are easier (and faster).

In order to clarify the "why" part of your question you can route your browser and your code's requests through a debug proxy. If you are using windows I suggest fiddler2. There exist other debug proxies for other platforms as well. But fiddler2 is definitely my favourite.
http://www.fiddler2.com/fiddler2/
https://www.owasp.org/index.php/Category:OWASP_WebScarab_Project
http://www.charlesproxy.com/
Or more low level
http://netcat.sourceforge.net/
http://www.wireshark.org/
Once you know the differences it is usually much simpler to come up with a solution. I suspect that the other answers with regard to stateful browsing / cookies are correct. With the mentioned tools you can analyze these cookies and roll a suitable solution without going for browser automation.

I think many sites use temporary links that only exist in your session. The code in the url is probably something like your session-id. That means the particular link will never work again.
You'll have to reopen the page that contains the link using some library that accomodates this session (like mentioned in other answers). And then try to locate the link and only use it in this session.

While the current accepted answer (by G. Shearer) is the best possible advice for scraping in general, I've found a way to skip a few steps - with a firefox extension called cliget that takes the request context with all the http headers and cookies and generates a curl (or wget) command that is copied to the clipboard.
EDIT: this feature is also available in the network panels of firebug and the chrome debugger - right click request, "copy as curl"
Most of the time you'll get a very verbose command with a few apparently unneeded headers, but you can remove those one by one until the server rejects the request, instead of the opposite (which, honestly, I find frustrating - I often got stuck thinking what header was missing from the request).
(Also, you might want to remove the -O option from the curl commandline to see the result in stdout instead of downloading it to a file, and add -v to see the full header list)
Even if you don't want to use curl/wget, converting one curl/wget commandline to python code is just a matter of knowing how to add headers to an urllib request (or any http request library for that matter)

There's an open source, Python library, named ghost, that wraps a headless, WebKit browser, so you can control everything through a simple API:
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
It supports cookies, JavaScript and everything else. You can inject JavaScript into the page, and while it's headless, so it doesn't render anything graphically, you still have the DOM. It's a complete browser.
It wouldn't scale well, but it's lots of fun, and may be useful when you need something approaching a complete browser.

from urllib import urlopen
print urlopen(url) #python built-in high level interface to get ANY online resources, auto responds to HTTP error codes.

Did you try requests module? it's much simpler to use than urllib2 and pycurl etc.
yet it's powerful. it has following features: The link is here
International Domains and URLs
Keep-Alive & Connection Pooling
Sessions with Cookie Persistence
Browser-style SSL Verification
Basic/Digest Authentication
Elegant Key/Value Cookies
Automatic Decompression
Unicode Response Bodies
Multipart File Uploads
Connection Timeouts
.netrc support
Python 2.6—3.3
Thread-safe.

You could use Internet Download Manager it is able to capture and download any streaming media from any website

What is called a library that allows emulation of a browser for automating purposes?

I have an automating task in which I need to fill several forms in a site with data from word documents. For that I would need a library that emulates a browser and allows me to programatically enter a site and access html elements. What is this called? Are there examples of libraries that do this for python or clojure?

You have a couple of choices:
Mechanize
Selenium
There are others too, but I can't remember them off the top of my head right now (will post as and when I remember more)

You may wanna take a look at PhantomJS too:
PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS selector,
JSON, Canvas, and SVG.
If you just want to submit a form, it would probably be easier to forge a request and send it using urllib2.

In nowadays Clojure, http-kit is my favorite. It just makes http interaction very easy.
; taken from github
(defn on-response [resp]
;; {:status 200 :body "....." :headers {:key val :key val}}
(println resp))
;;; initialize, timeout is 40s, and default user-agent
(http/init :timeout 40000 :user-agent "http-kit/1.1")
;;; other params :headers :proxy binary? keyify?
(http/get {:url "http://shenfeng.me" :cb on-response})
;;; other params :headers :proxy binary? keyify?
(http/post {:url "http://example/"
:cb on-response
:body {"name" "http-kit" "author" "shenfeng"} :binary? true})
I have also used CasperJs and it just makes any headless browsing possible. Also, you can interact with the client side javascript while automating the browsing.
The only draw back I found was that it was slightly harder to integrate all this with existing code, but as a standalone tool it was perfect. It also supports both coffescript and javascript scripting.
Look at the Quickstart to get an idea on how it works.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.